[Inductor] Fix max_autotune BMM correctness with dynamic OpenMP threads#169128
[Inductor] Fix max_autotune BMM correctness with dynamic OpenMP threads#169128dumko2001 wants to merge 3 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169128
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled Job, 6 Unrelated FailuresAs of commit 8e45989 with merge base ebff479 ( CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
desertfire
left a comment
There was a problem hiding this comment.
Thanks for the fix! Can you add a unit test for this?
|
@pytorchbot label "release notes: bug fix" |
|
Didn't find following labels among repository labels: release notes: bug fix |
|
@desertfire I have added the unit test test_max_autotune_bmm_omp_dynamic in test/inductor/test_cpu_repro.py as requested. The test uses ctypes to directly call omp_set_dynamic(1) from the loaded OpenMP library. This allows us to simulate the bug condition (where OpenMP provides fewer threads than requested) without needing to import heavy external libraries like cv2. I verified the test logic locally to ensure it correctly detects the OpenMP library (or skips if missing), but I am relying on the CI to run the full end-to-end verification as I don't have a local build environment. |
|
@pytorchbot label "release notes: bug fixes" |
|
Didn't find following labels among repository labels: release notes: bug fixes |
|
@pytorchbot label "release notes: bug" |
|
Didn't find following labels among repository labels: release notes: bug |
|
@pytorchbot label "topic: not user facing" |
desertfire
left a comment
There was a problem hiding this comment.
To fix lint error, please refer to https://github.com/pytorch/pytorch/wiki/lintrunner
605562f to
67b3db8
Compare
|
I've isolated the core regression to the OpenMP parallelization structure in cpp_gemm_template.py. The issue is that the logic inside the loop requires synchronization (likely a reduction/barrier for K-slicing) which is semantically invalid when wrapped in a single #pragma omp parallel for. This is causing the multiple CI failures (deadlocks/correctness errors). The fix requires a structural restructuring of the parallel region. We need to switch to a model that allows for explicit synchronization, such as: Option: Compute/Reduce Split (Two sequential #pragma omp for blocks inside one #pragma omp parallel block). |
|
Following up on my question, after further research in aten/src/ATen/native/cpu/Reduce.h, I determined the Compute/Reduce Split pattern is the canonical and safest way to handle phased reductions/synchronization in PyTorch's CPU backend. |
|
cc @EikanWang |
|
@pytorchbot label "release notes: inductor" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot --help |
PyTorchBot HelpMergeRevertRebaseLabelDr CIcherry-pick |
|
Could you please manually remove the topic: not user facing label? The PR is a correctness fix that should be included in the release notes, Thank you! |
|
@desertfire Just checking in on this! Since the changes are approved, could you please approve the CI workflows? They are currently stuck on "awaiting approval," so the full test suite hasn't run yet. Once CI is green, I think this is ready to merge. Thanks! |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
|
|
The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:
Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
@desertfire I have rebased since lint was failing. can u run ci again |
When omp_set_dynamic(1) is active (e.g. after importing cv2), OpenMP may spawn fewer threads than requested by num_threads(N). The old code used omp_get_thread_num() to index into per-thread work, so any thread ID that was never created left its slice of the output uncomputed, producing wrong results. Replace the omp_get_thread_num() pattern with #pragma omp for schedule(static, 1) inside an explicit for(tid=0; tid<N; tid++) loop. OpenMP's work-sharing guarantees that every iteration runs regardless of how many actual threads are spawned, eliminating the correctness bug. Also remove the unused DTYPE_TO_CPP import from cpp_grouped_gemm_template and add a regression test that exercises the bug by calling omp_set_dynamic(1) before running a compiled bmm.
416fd58 to
8e45989
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 7 checks: trunk / macos-py3-arm64 / build, trunk / linux-jammy-aarch64-py3.10 / test (default, 2, 5, lf.linux.arm64.m7g.4xlarge), trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-rocm-py3.10-mi355 / test (default, 5, 6, linux.rocm.gpu.gfx950.1, unstable), trunk / linux-jammy-rocm-py3.10-mi355 / test (default, 4, 6, linux.rocm.gpu.gfx950.1, unstable), trunk / linux-jammy-rocm-py3.10-mi355 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.2, unstable), trunk / linux-jammy-rocm-py3.10-mi355 / test (distributed, 1, 3, linux.rocm.gpu.gfx950.2, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ds (pytorch#169128) # Summary This PR fixes a silent correctness bug in `max_autotune` BMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries like `cv2`). Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (`tid = omp_get_thread_num()`). If OpenMP provided fewer threads than requested, tasks for higher `tid`s were simply skipped, leading to wrong results. This change switches from a raw `#pragma omp parallel` block to `#pragma omp parallel for`. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count. # Test Plan Verified that the generated C++ template in `torch/_inductor/codegen/cpp_gemm_template.py` now uses: ```cpp #pragma omp parallel for num_threads(...) for (int64_t tid = 0; tid < ...; tid++) { ... } This preserves existing memory barrier/cache behavior while fixing the task distribution logic. ``` Fixes Fixes pytorch#168965 # Performance Note This change switches the scheduling model to `omp parallel for`. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations. It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression. Pull Request resolved: pytorch#169128 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/mlazos, https://github.com/jansel
…ds (pytorch#169128) # Summary This PR fixes a silent correctness bug in `max_autotune` BMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries like `cv2`). Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (`tid = omp_get_thread_num()`). If OpenMP provided fewer threads than requested, tasks for higher `tid`s were simply skipped, leading to wrong results. This change switches from a raw `#pragma omp parallel` block to `#pragma omp parallel for`. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count. # Test Plan Verified that the generated C++ template in `torch/_inductor/codegen/cpp_gemm_template.py` now uses: ```cpp #pragma omp parallel for num_threads(...) for (int64_t tid = 0; tid < ...; tid++) { ... } This preserves existing memory barrier/cache behavior while fixing the task distribution logic. ``` Fixes Fixes pytorch#168965 # Performance Note This change switches the scheduling model to `omp parallel for`. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations. It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression. Pull Request resolved: pytorch#169128 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/mlazos, https://github.com/jansel
Summary
This PR fixes a silent correctness bug in
max_autotuneBMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries likecv2).Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (
tid = omp_get_thread_num()). If OpenMP provided fewer threads than requested, tasks for highertids were simply skipped, leading to wrong results.This change switches from a raw
#pragma omp parallelblock to#pragma omp parallel for. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count.Test Plan
Verified that the generated C++ template in
torch/_inductor/codegen/cpp_gemm_template.pynow uses:Fixes
Fixes #168965
Performance Note
This change switches the scheduling model to
omp parallel for. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations.It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos