Skip to content

max_autotuned BMM produces wrong result when multiple threads are used #168965

@mstebelev

Description

@mstebelev

🐛 Describe the bug

I noticed that when I use aoti_compile_and_package with max_autotune, in certain conditions the result is wrong. Specifically:

  1. It's important to set_num_threads(4). With 1 threads it doesn't reproduce
  2. It's important to do import cv2, without it the bug doesn't reproduce
  3. Adding os.environ['OPENCV_FOR_OPENMP_DYNAMIC_DISABLE'] = '1' before import fixes the issue

My explanation of this behavior is that code produced by max_autotune looks like this

void cpp_CppMicroGemmFP32Vec_threaded_mm(const float* X, const float* W, float* Y, const int64_t ks_b_index)
...
    #pragma omp parallel num_threads(4)
    {
        
        const int tid = omp_get_thread_num();
        const int64_t k_group_id = tid / num_Kt_blocks;
        const int64_t k_slice_id = tid % num_Kt_blocks;
...

and the code relies that this block would be really executed 4 times in parallel. But if you call omp_set_dynamic, openmp can ignore this thread hint and run the code less times that leads to wrong results and this behavior is documented here. Unfortunatly omp_set_dynamic is called while I'm importing cv2 library, specifically here when just loading shared library.
So, I think it should be fixed somehow, to not depend on this kind of OMP behavior, and maybe even use at::parallel_for instead, because different parallelizing backends can be enabled, not necessary openmp

This notebook should reproduce the bug, but I didn't manage to do it in colab because there max_autotune chooses different implementation and pytorch version is also different.

data.zip

On pytorch 2.9 it doesn't reproduce, but I noticed that the generated code is using different constants. Maybe layout of input tensors in BMM has changed, so the bug isn't triggered, but anyway the code still relies on the invariant that actuall executed count is equal to #pragma omp parallel num_threads=N

Error logs

No response

Versions

Collecting environment information...
PyTorch version: 2.7.0
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (aarch64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-134-generic-aarch64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40S
Nvidia driver version: 550.127.05
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         aarch64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            ARM
Model name:                           Neoverse-N1
Model:                                1
Thread(s) per core:                   1
Core(s) per cluster:                  128
Socket(s):                            -
Cluster(s):                           1
Stepping:                             r3p1
Frequency boost:                      disabled
CPU(s) scaling MHz:                   41%
CPU max MHz:                          3000.0000
CPU min MHz:                          1000.0000
BogoMIPS:                             50.00
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
L1d cache:                            8 MiB (128 instances)
L1i cache:                            8 MiB (128 instances)
L2 cache:                             128 MiB (128 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; CSV2, BHB
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] torch==2.7.0
[conda] Could not collect

cc @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @yushangdi @benjaminglass1 @jataylo @iupaikov-amd

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: aotinductoraot inductormodule: correctness (silent)issue that returns an incorrect result silentlyoncall: cpu inductorCPU Inductor issues for Intel team to triageoncall: exportoncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions