max_autotuned BMM produces wrong result when multiple threads are used

### 🐛 Describe the bug

I noticed that when I use aoti_compile_and_package with max_autotune, in certain conditions the result is wrong. Specifically:
1. It's important to `set_num_threads(4)`. With 1 threads it doesn't reproduce
2. It's important to do `import cv2`, without it the bug doesn't reproduce
3. Adding `os.environ['OPENCV_FOR_OPENMP_DYNAMIC_DISABLE'] = '1'` before import fixes the issue

My explanation of this behavior is that code produced by max_autotune looks like this
```
void cpp_CppMicroGemmFP32Vec_threaded_mm(const float* X, const float* W, float* Y, const int64_t ks_b_index)
...
    #pragma omp parallel num_threads(4)
    {
        
        const int tid = omp_get_thread_num();
        const int64_t k_group_id = tid / num_Kt_blocks;
        const int64_t k_slice_id = tid % num_Kt_blocks;
...
```
and the code relies that this block would be really executed 4 times in parallel. But if you call `omp_set_dynamic`, openmp can ignore this thread hint and run the code less times that leads to wrong results and this behavior is documented [here](https://www.openmp.org/spec-html/5.0/openmpsu35.html#x55-860002.6.1). Unfortunatly omp_set_dynamic is called while I'm importing `cv2` library, specifically [here](https://github.com/opencv/opencv/blob/4.x/modules/core/src/parallel.cpp#L470) when just loading shared library.
So, I think it should be fixed somehow, to not depend on this kind of OMP behavior, and maybe even use at::parallel_for instead, because different parallelizing backends can be enabled, not necessary openmp

[This](https://colab.research.google.com/drive/1fDz0ZcDbYhluSTQ-ldPcZebS65YPP5KX?usp=sharing) notebook should reproduce the bug, but I didn't manage to do it in colab because there max_autotune chooses different implementation and pytorch version is also different.

[data.zip](https://github.com/user-attachments/files/23722728/data.zip)

On pytorch 2.9 it doesn't reproduce, but I noticed that the generated code is using different constants. Maybe layout of input tensors in BMM has changed, so the bug isn't triggered, but anyway the code still relies on the invariant that actuall executed count is equal to `#pragma omp parallel num_threads=N`

### Error logs

_No response_

### Versions

```
Collecting environment information...
PyTorch version: 2.7.0
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (aarch64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-134-generic-aarch64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40S
Nvidia driver version: 550.127.05
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         aarch64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            ARM
Model name:                           Neoverse-N1
Model:                                1
Thread(s) per core:                   1
Core(s) per cluster:                  128
Socket(s):                            -
Cluster(s):                           1
Stepping:                             r3p1
Frequency boost:                      disabled
CPU(s) scaling MHz:                   41%
CPU max MHz:                          3000.0000
CPU min MHz:                          1000.0000
BogoMIPS:                             50.00
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
L1d cache:                            8 MiB (128 instances)
L1i cache:                            8 MiB (128 instances)
L2 cache:                             128 MiB (128 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; CSV2, BHB
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] torch==2.7.0
[conda] Could not collect
```

cc @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @yushangdi @benjaminglass1 @jataylo @iupaikov-amd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_autotuned BMM produces wrong result when multiple threads are used #168965

🐛 Describe the bug

Error logs

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

max_autotuned BMM produces wrong result when multiple threads are used #168965

Description

🐛 Describe the bug

Error logs

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions