Skip to content

[Inductor] Fix max_autotune BMM correctness with dynamic OpenMP threads#169128

Closed
dumko2001 wants to merge 3 commits into
pytorch:mainfrom
dumko2001:fix-issue-168965
Closed

[Inductor] Fix max_autotune BMM correctness with dynamic OpenMP threads#169128
dumko2001 wants to merge 3 commits into
pytorch:mainfrom
dumko2001:fix-issue-168965

Conversation

@dumko2001

@dumko2001 dumko2001 commented Nov 26, 2025

Copy link
Copy Markdown
Contributor

Summary

This PR fixes a silent correctness bug in max_autotune BMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries like cv2).

Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (tid = omp_get_thread_num()). If OpenMP provided fewer threads than requested, tasks for higher tids were simply skipped, leading to wrong results.

This change switches from a raw #pragma omp parallel block to #pragma omp parallel for. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count.

Test Plan

Verified that the generated C++ template in torch/_inductor/codegen/cpp_gemm_template.py now uses:

#pragma omp parallel for num_threads(...)
for (int64_t tid = 0; tid < ...; tid++) { ... }
This preserves existing memory barrier/cache behavior while fixing the task distribution logic.

Fixes
Fixes #168965

Performance Note

This change switches the scheduling model to omp parallel for. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations.
It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos

@pytorch-bot

pytorch-bot Bot commented Nov 26, 2025

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169128

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 6 Unrelated Failures

As of commit 8e45989 with merge base ebff479 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@desertfire desertfire left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Can you add a unit test for this?

@dumko2001

Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: bug fix"

@pytorch-bot

pytorch-bot Bot commented Dec 1, 2025

Copy link
Copy Markdown

Didn't find following labels among repository labels: release notes: bug fix

@dumko2001

Copy link
Copy Markdown
Contributor Author

@desertfire I have added the unit test test_max_autotune_bmm_omp_dynamic in test/inductor/test_cpu_repro.py as requested.

The test uses ctypes to directly call omp_set_dynamic(1) from the loaded OpenMP library. This allows us to simulate the bug condition (where OpenMP provides fewer threads than requested) without needing to import heavy external libraries like cv2.

I verified the test logic locally to ensure it correctly detects the OpenMP library (or skips if missing), but I am relying on the CI to run the full end-to-end verification as I don't have a local build environment.

@dumko2001

Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: bug fixes"

@pytorch-bot

pytorch-bot Bot commented Dec 1, 2025

Copy link
Copy Markdown

Didn't find following labels among repository labels: release notes: bug fixes

@dumko2001

Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: bug"

@pytorch-bot

pytorch-bot Bot commented Dec 1, 2025

Copy link
Copy Markdown

Didn't find following labels among repository labels: release notes: bug

@dumko2001

Copy link
Copy Markdown
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot Bot added the topic: not user facing topic category label Dec 1, 2025

@desertfire desertfire left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix lint error, please refer to https://github.com/pytorch/pytorch/wiki/lintrunner

Comment thread test/inductor/test_cpu_repro.py Outdated
@dumko2001

Copy link
Copy Markdown
Contributor Author

@desertfire,

I've isolated the core regression to the OpenMP parallelization structure in cpp_gemm_template.py.

The issue is that the logic inside the loop requires synchronization (likely a reduction/barrier for K-slicing) which is semantically invalid when wrapped in a single #pragma omp parallel for. This is causing the multiple CI failures (deadlocks/correctness errors).

The fix requires a structural restructuring of the parallel region. We need to switch to a model that allows for explicit synchronization, such as:

Option: Compute/Reduce Split (Two sequential #pragma omp for blocks inside one #pragma omp parallel block).
Option: Single parallel for Loop (if partitioning can be made purely thread-independent).
Can you advise which parallel structure (Split-Loop vs. Thread-Centric Parallel Region) is preferred for the Inductor codebase to handle this K-slicing reduction?

@dumko2001

Copy link
Copy Markdown
Contributor Author

@desertfire,

Following up on my question, after further research in aten/src/ATen/native/cpu/Reduce.h, I determined the Compute/Reduce Split pattern is the canonical and safest way to handle phased reductions/synchronization in PyTorch's CPU backend.

@desertfire

Copy link
Copy Markdown
Contributor

cc @EikanWang

@dumko2001

Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: inductor"

@pytorch-bot

pytorch-bot Bot commented Dec 4, 2025

Copy link
Copy Markdown

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'remove' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@dumko2001

Copy link
Copy Markdown
Contributor Author

@pytorchbot --help

@pytorch-bot

pytorch-bot Bot commented Dec 4, 2025

Copy link
Copy Markdown

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO, --into ONTO
                        Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

@dumko2001

Copy link
Copy Markdown
Contributor Author

@desertfire,

Could you please manually remove the topic: not user facing label?

The PR is a correctness fix that should be included in the release notes, Thank you!

@dumko2001

Copy link
Copy Markdown
Contributor Author

@desertfire Just checking in on this! Since the changes are approved, could you please approve the CI workflows? They are currently stuck on "awaiting approval," so the full test suite hasn't run yet.

Once CI is green, I think this is ready to merge. Thanks!

@desertfire

Copy link
Copy Markdown
Contributor

@pytorchbot rebase

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorch-bot pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests labels Apr 14, 2026
@pytorch-bot

pytorch-bot Bot commented Apr 14, 2026

Copy link
Copy Markdown

Workflows were awaiting approval. CI has now been triggered for the ciflow labels on this PR.

@pytorch-bot

pytorch-bot Bot commented Apr 14, 2026

Copy link
Copy Markdown

The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:

  • ciflow/inductor
  • ciflow/torchtitan

Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows.

@dumko2001

Copy link
Copy Markdown
Contributor Author

@desertfire I have rebased since lint was failing. can u run ci again

When omp_set_dynamic(1) is active (e.g. after importing cv2), OpenMP may
spawn fewer threads than requested by num_threads(N).  The old code used
omp_get_thread_num() to index into per-thread work, so any thread ID that
was never created left its slice of the output uncomputed, producing wrong
results.

Replace the omp_get_thread_num() pattern with #pragma omp for
schedule(static, 1) inside an explicit for(tid=0; tid<N; tid++) loop.
OpenMP's work-sharing guarantees that every iteration runs regardless of
how many actual threads are spawned, eliminating the correctness bug.

Also remove the unused DTYPE_TO_CPP import from cpp_grouped_gemm_template
and add a regression test that exercises the bug by calling
omp_set_dynamic(1) before running a compiled bmm.
@jgong5

jgong5 commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 27, 2026
@pytorchmergebot

Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@jansel

jansel commented May 8, 2026

Copy link
Copy Markdown
Contributor

@pytorchbot merge

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team Raised by workflow job

@jansel

jansel commented May 9, 2026

Copy link
Copy Markdown
Contributor

@pytorchbot merge -i

dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request May 13, 2026
…ds (pytorch#169128)

# Summary
This PR fixes a silent correctness bug in `max_autotune` BMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries like `cv2`).

Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (`tid = omp_get_thread_num()`). If OpenMP provided fewer threads than requested, tasks for higher `tid`s were simply skipped, leading to wrong results.

This change switches from a raw `#pragma omp parallel` block to `#pragma omp parallel for`. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count.

# Test Plan
Verified that the generated C++ template in `torch/_inductor/codegen/cpp_gemm_template.py` now uses:
```cpp
#pragma omp parallel for num_threads(...)
for (int64_t tid = 0; tid < ...; tid++) { ... }
This preserves existing memory barrier/cache behavior while fixing the task distribution logic.
```
Fixes
Fixes pytorch#168965

# Performance Note
This change switches the scheduling model to `omp parallel for`. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations.
It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression.

Pull Request resolved: pytorch#169128
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/mlazos, https://github.com/jansel
Alokksinha00 pushed a commit to Alokksinha00/pytorch that referenced this pull request May 15, 2026
…ds (pytorch#169128)

# Summary
This PR fixes a silent correctness bug in `max_autotune` BMM kernels when the OpenMP thread count is dynamic (e.g., reduced by external libraries like `cv2`).

Previously, the generated C++ code assumed a 1-to-1 mapping between physical threads and work items (`tid = omp_get_thread_num()`). If OpenMP provided fewer threads than requested, tasks for higher `tid`s were simply skipped, leading to wrong results.

This change switches from a raw `#pragma omp parallel` block to `#pragma omp parallel for`. This ensures OpenMP automatically distributes the loop iterations (work items) across available threads, guaranteeing all work is computed regardless of the actual thread count.

# Test Plan
Verified that the generated C++ template in `torch/_inductor/codegen/cpp_gemm_template.py` now uses:
```cpp
#pragma omp parallel for num_threads(...)
for (int64_t tid = 0; tid < ...; tid++) { ... }
This preserves existing memory barrier/cache behavior while fixing the task distribution logic.
```
Fixes
Fixes pytorch#168965

# Performance Note
This change switches the scheduling model to `omp parallel for`. While this fixes the correctness bug, I could not verify if this introduces any scheduling overhead on standard benchmarks due to hardware limitations.
It would be helpful if a reviewer could run a standard Inductor benchmark to ensure no performance regression.

Pull Request resolved: pytorch#169128
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/mlazos, https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor open source release notes: inductor topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

max_autotuned BMM produces wrong result when multiple threads are used

8 participants