[Inductor] Fix max_autotune BMM correctness with dynamic OpenMP threads by dumko2001 · Pull Request #169128 · pytorch/pytorch

dumko2001 · 2025-11-26T16:21:50Z

Summary

This PR fixes a silent correctness bug in max_autotune BMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries like cv2).

Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (tid = omp_get_thread_num()). If OpenMP provided fewer threads than requested, tasks for higher tids were simply skipped, leading to wrong results.

This change switches from a raw #pragma omp parallel block to #pragma omp parallel for. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count.

Test Plan

Verified that the generated C++ template in torch/_inductor/codegen/cpp_gemm_template.py now uses:

#pragma omp parallel for num_threads(...)
for (int64_t tid = 0; tid < ...; tid++) { ... }
This preserves existing memory barrier/cache behavior while fixing the task distribution logic.

Fixes
Fixes #168965

Performance Note

This change switches the scheduling model to omp parallel for. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations.
It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos

pytorch-bot · 2025-11-26T16:21:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169128

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 6 Unrelated Failures

As of commit 8e45989 with merge base ebff479 ():

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / macos-py3-arm64 / build (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / linux-jammy-aarch64-py3.10 / test (default, 2, 5, lf.linux.arm64.m7g.4xlarge) (gh) (disabled by #136125, #137026, #137027 but the issue was closed recently and a rebase is needed to make it pass)
test/inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration
trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #136125, #137026, #137027 but the issue was closed recently and a rebase is needed to make it pass)
test/inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-rocm-py3.10-mi355 / test (default, 4, 6, linux.rocm.gpu.gfx950.1, unstable) (gh)
test/inductor/test_compiled_autograd.py::TestCompiledAutogradOpInfoCUDA::test_hops_in_bwd_inline_asm_elementwise_simple_cuda_float32
trunk / linux-jammy-rocm-py3.10-mi355 / test (default, 5, 6, linux.rocm.gpu.gfx950.1, unstable) (gh)
test/inductor/test_custom_op_autotune.py::TestCustomOpAutoTune::test_benchmark_with_cudagraphs_uses_cuda_graph_benchmarking
trunk / linux-jammy-rocm-py3.10-mi355 / test (distributed, 1, 3, linux.rocm.gpu.gfx950.2, unstable) (gh)
distributed/test_c10d_nccl.py::ProcessGroupNCCLGroupTest::test_resume
trunk / linux-jammy-rocm-py3.10-mi355 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.2, unstable) (gh)
distributed/test_c10d_nccl.py::ProcessGroupNCCLGroupTest::test_suspend

This comment was automatically generated by Dr. CI and updates every 15 minutes.

desertfire

Thanks for the fix! Can you add a unit test for this?

dumko2001 · 2025-12-01T19:03:03Z

@pytorchbot label "release notes: bug fix"

pytorch-bot · 2025-12-01T19:03:11Z

Didn't find following labels among repository labels: release notes: bug fix

dumko2001 · 2025-12-01T19:05:26Z

@desertfire I have added the unit test test_max_autotune_bmm_omp_dynamic in test/inductor/test_cpu_repro.py as requested.

The test uses ctypes to directly call omp_set_dynamic(1) from the loaded OpenMP library. This allows us to simulate the bug condition (where OpenMP provides fewer threads than requested) without needing to import heavy external libraries like cv2.

I verified the test logic locally to ensure it correctly detects the OpenMP library (or skips if missing), but I am relying on the CI to run the full end-to-end verification as I don't have a local build environment.

dumko2001 · 2025-12-01T19:05:44Z

@pytorchbot label "release notes: bug fixes"

pytorch-bot · 2025-12-01T19:05:52Z

Didn't find following labels among repository labels: release notes: bug fixes

dumko2001 · 2025-12-01T19:07:10Z

@pytorchbot label "release notes: bug"

pytorch-bot · 2025-12-01T19:07:18Z

Didn't find following labels among repository labels: release notes: bug

dumko2001 · 2025-12-01T19:07:43Z

@pytorchbot label "topic: not user facing"

desertfire

To fix lint error, please refer to https://github.com/pytorch/pytorch/wiki/lintrunner

dumko2001 · 2025-12-04T02:10:21Z

@desertfire,

I've isolated the core regression to the OpenMP parallelization structure in cpp_gemm_template.py.

The issue is that the logic inside the loop requires synchronization (likely a reduction/barrier for K-slicing) which is semantically invalid when wrapped in a single #pragma omp parallel for. This is causing the multiple CI failures (deadlocks/correctness errors).

The fix requires a structural restructuring of the parallel region. We need to switch to a model that allows for explicit synchronization, such as:

Option: Compute/Reduce Split (Two sequential #pragma omp for blocks inside one #pragma omp parallel block).
Option: Single parallel for Loop (if partitioning can be made purely thread-independent).
Can you advise which parallel structure (Split-Loop vs. Thread-Centric Parallel Region) is preferred for the Inductor codebase to handle this K-slicing reduction?

dumko2001 · 2025-12-04T12:41:06Z

@desertfire,

Following up on my question, after further research in aten/src/ATen/native/cpu/Reduce.h, I determined the Compute/Reduce Split pattern is the canonical and safest way to handle phased reductions/synchronization in PyTorch's CPU backend.

desertfire · 2025-12-04T15:53:39Z

cc @EikanWang

dumko2001 · 2025-12-04T16:41:08Z

@pytorchbot label "release notes: inductor"

pytorch-bot · 2025-12-04T16:41:32Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'remove' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

dumko2001 · 2025-12-04T16:42:38Z

@pytorchbot --help

pytorch-bot · 2025-12-04T16:42:40Z

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO, --into ONTO
                        Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

dumko2001 · 2025-12-04T16:45:48Z

@desertfire,

Could you please manually remove the topic: not user facing label?

The PR is a correctness fix that should be included in the release notes, Thank you!

dumko2001 · 2025-12-18T16:25:12Z

@desertfire Just checking in on this! Since the changes are approved, could you please approve the CI workflows? They are currently stuck on "awaiting approval," so the full test suite hasn't run yet.

Once CI is green, I think this is ready to merge. Thanks!

desertfire · 2026-01-06T18:25:39Z

@pytorchbot rebase

pytorchmergebot · 2026-01-06T18:33:14Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorch-bot · 2026-04-14T16:51:29Z

~~Workflows were awaiting approval.~~ CI has now been triggered for the ciflow labels on this PR.

pytorch-bot · 2026-04-14T16:51:30Z

The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:

ciflow/inductor
ciflow/torchtitan

Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows.

dumko2001 · 2026-04-14T16:52:26Z

@desertfire I have rebased since lint was failing. can u run ci again

When omp_set_dynamic(1) is active (e.g. after importing cv2), OpenMP may spawn fewer threads than requested by num_threads(N). The old code used omp_get_thread_num() to index into per-thread work, so any thread ID that was never created left its slice of the output uncomputed, producing wrong results. Replace the omp_get_thread_num() pattern with #pragma omp for schedule(static, 1) inside an explicit for(tid=0; tid<N; tid++) loop. OpenMP's work-sharing guarantees that every iteration runs regardless of how many actual threads are spawned, eliminating the correctness bug. Also remove the unused DTYPE_TO_CPP import from cpp_grouped_gemm_template and add a regression test that exercises the bug by calling omp_set_dynamic(1) before running a compiled bmm.

jgong5 · 2026-04-27T01:01:23Z

@pytorchbot merge

pytorchmergebot · 2026-04-27T01:03:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-04-27T07:01:48Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jansel · 2026-05-08T23:26:16Z

@pytorchbot merge

pytorchmergebot · 2026-05-08T23:28:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-05-08T23:29:04Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

jansel · 2026-05-09T00:28:12Z

@pytorchbot merge -i

pytorchmergebot · 2026-05-09T00:30:30Z

Merge started

Your change will be merged while ignoring the following 7 checks: trunk / macos-py3-arm64 / build, trunk / linux-jammy-aarch64-py3.10 / test (default, 2, 5, lf.linux.arm64.m7g.4xlarge), trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-rocm-py3.10-mi355 / test (default, 5, 6, linux.rocm.gpu.gfx950.1, unstable), trunk / linux-jammy-rocm-py3.10-mi355 / test (default, 4, 6, linux.rocm.gpu.gfx950.1, unstable), trunk / linux-jammy-rocm-py3.10-mi355 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.2, unstable), trunk / linux-jammy-rocm-py3.10-mi355 / test (distributed, 1, 3, linux.rocm.gpu.gfx950.2, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ds (pytorch#169128) # Summary This PR fixes a silent correctness bug in `max_autotune` BMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries like `cv2`). Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (`tid = omp_get_thread_num()`). If OpenMP provided fewer threads than requested, tasks for higher `tid`s were simply skipped, leading to wrong results. This change switches from a raw `#pragma omp parallel` block to `#pragma omp parallel for`. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count. # Test Plan Verified that the generated C++ template in `torch/_inductor/codegen/cpp_gemm_template.py` now uses: ```cpp #pragma omp parallel for num_threads(...) for (int64_t tid = 0; tid < ...; tid++) { ... } This preserves existing memory barrier/cache behavior while fixing the task distribution logic. ``` Fixes Fixes pytorch#168965 # Performance Note This change switches the scheduling model to `omp parallel for`. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations. It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression. Pull Request resolved: pytorch#169128 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/mlazos, https://github.com/jansel

pytorch-bot Bot added the module: inductor label Nov 26, 2025

dumko2001 mentioned this pull request Nov 26, 2025

max_autotuned BMM produces wrong result when multiple threads are used #168965

Closed

pytorchbot added the open source label Nov 26, 2025

albanD mentioned this pull request Nov 26, 2025

Minimal, comprehensive test suite #167721

Open

desertfire reviewed Dec 1, 2025

View reviewed changes

pytorch-bot Bot added the topic: not user facing topic category label Dec 1, 2025

desertfire approved these changes Dec 1, 2025

View reviewed changes

desertfire reviewed Dec 3, 2025

View reviewed changes

Comment thread test/inductor/test_cpu_repro.py Outdated

dumko2001 force-pushed the fix-issue-168965 branch from 605562f to 67b3db8 Compare December 3, 2025 18:42

pytorch-bot Bot added the release notes: inductor label Dec 4, 2025

pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests labels Apr 14, 2026

dumko2001 added 3 commits April 15, 2026 18:45

Fix nested OpenMP worksharing in GEMM template

c0a8789

[Inductor] Fix micro-gemm state scope in GEMM templates

8e45989

dumko2001 force-pushed the fix-issue-168965 branch from 416fd58 to 8e45989 Compare April 15, 2026 13:35

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 27, 2026

pytorchmergebot added the merging label Apr 27, 2026

jgong5 approved these changes Apr 27, 2026

View reviewed changes

mlazos approved these changes Apr 27, 2026

View reviewed changes

jansel approved these changes May 8, 2026

View reviewed changes

pytorchmergebot removed the merging label May 8, 2026

pytorchmergebot added the merging label May 9, 2026

pytorchmergebot added the Merged label May 9, 2026

pytorchmergebot closed this in bc5fea3 May 9, 2026

pytorchmergebot removed the merging label May 9, 2026

Conversation

dumko2001 commented Nov 26, 2025 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Performance Note

Uh oh!

pytorch-bot Bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169128

❌ 1 Cancelled Job, 6 Unrelated Failures

Uh oh!

desertfire left a comment

Choose a reason for hiding this comment

Uh oh!

dumko2001 commented Dec 1, 2025

Uh oh!

pytorch-bot Bot commented Dec 1, 2025

Uh oh!

dumko2001 commented Dec 1, 2025

Uh oh!

dumko2001 commented Dec 1, 2025

Uh oh!

pytorch-bot Bot commented Dec 1, 2025

Uh oh!

dumko2001 commented Dec 1, 2025

Uh oh!

pytorch-bot Bot commented Dec 1, 2025

Uh oh!

dumko2001 commented Dec 1, 2025

Uh oh!

desertfire left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dumko2001 commented Dec 4, 2025

Uh oh!

dumko2001 commented Dec 4, 2025

Uh oh!

desertfire commented Dec 4, 2025

Uh oh!

dumko2001 commented Dec 4, 2025

Uh oh!

pytorch-bot Bot commented Dec 4, 2025

Uh oh!

dumko2001 commented Dec 4, 2025

Uh oh!

pytorch-bot Bot commented Dec 4, 2025

PyTorchBot Help

Merge

Revert

Rebase

Label

Dr CI

cherry-pick

Uh oh!

dumko2001 commented Dec 4, 2025

Uh oh!

dumko2001 commented Dec 18, 2025

Uh oh!

desertfire commented Jan 6, 2026

Uh oh!

pytorchmergebot commented Jan 6, 2026

Uh oh!

pytorch-bot Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 14, 2026

Uh oh!

dumko2001 commented Apr 14, 2026

Uh oh!

jgong5 commented Apr 27, 2026

Uh oh!

pytorchmergebot commented Apr 27, 2026

Merge started

Uh oh!

pytorchmergebot commented Apr 27, 2026

Uh oh!

jansel commented May 8, 2026

dumko2001 commented Nov 26, 2025 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Nov 26, 2025 •

edited

Loading

pytorch-bot Bot commented Apr 14, 2026 •

edited

Loading