use two pass reduction for deterministic reduction order by zhuhaozhe · Pull Request #115620 · pytorch/pytorch

zhuhaozhe · 2023-12-12T09:40:56Z

Motivation

Address the non-deterministic reduction order issue for omp parallel reduction.

Latest update on 1.15:

55d8190.
Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the working set of loop body is quite large, read/write from register/memory will have a large gap.

vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0
vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4)

Examples code:

tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    #pragma omp for
    for(...){
        ....
        tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x;  // access array will always from memory
    }
}

will be changed to

tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    **auto tmp0_acc_local = 0;**
    #pragma omp for
    for(...){
        ....
        **tmp0_acc_local**  = tmp0_acc_local + tmp_x;
    }
    **tmp0_acc_arr[tid] = tmp0_acc_local;**
}

Descriptions

Following aten to use two pass reduction with omp parallel for deterministic reduction order.

pytorch/aten/src/ATen/Parallel-inl.h

Line 39 in 9c3ae37

inline scalar_t parallel_reduce(

pytorch/aten/src/ATen/native/TensorIteratorReduce.cpp

Line 24 in 9c3ae37

void TensorIteratorBase::parallel_reduce(loop2d_t loop) {

            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            // init reduction buffer per thread
            float tmp_acc0_arr[64];
            at::vec::Vectorized<float> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++) 
            {
                tmp_acc0_arr[tid] = 0;
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0));
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2 * tmp2;
                    // reduce to per thread buffers
                    tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3;
                }
            }
            // second pass reduce
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid];
                tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0);

Test results

I test this PR with dynamo benchmark on 32-core ICX system,
Result (avg speed up):

	before this PR	after this PR
torchbench	1.303	1.301
hugginface	1.346	1.343
timms	1.971	1.970

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

multi_threads_test() {
    CORES=$(lscpu | grep Core | awk '{print $4}')
    export OMP_NUM_THREADS=$CORES
    end_core=$(expr $CORES - 1)    
    numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv
}

SCENARIO=performance
DT=float32
export TORCHINDUCTOR_FREEZING=1
Flag_extra="--freezing"
Mode_extra="--inference"

for suite in timm_models huggingface torchbench
do
  export SUITE=$suite
  echo $SUITE
  export LOG_BASE=`date +%m%d%H%M%S`
  mkdir $LOG_BASE
  multi_threads_test
done

System info

ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo
                         vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs
                         aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   1.5 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    40 MiB (32 instances)
  L3:                    54 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Stack from ghstack (oldest at bottom):

-> use two pass reduction for deterministic reduction order #115620

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2023-12-12T09:40:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115620

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d3933e6 with merge base 96ed37a ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/checkpoint/test_state_dict.py::TestStateDict::test_compiled_fsdp
pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)
npm ERR! code EEXIST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 383973b Pull Request resolved: #115620

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 625ff4d Pull Request resolved: #115620

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: c74f680 Pull Request resolved: #115620

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 409ffaf Pull Request resolved: #115620

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

zhuhaozhe · 2023-12-15T01:30:55Z

@pytorchbot rebase

pytorchmergebot · 2023-12-15T01:32:37Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

pytorchmergebot · 2023-12-15T01:32:56Z

Successfully rebased gh/zhuhaozhe/4/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/115620)

ghstack-source-id: 0b92585 Pull Request resolved: #115620

zhuhaozhe · 2023-12-16T13:53:29Z

@pytorchbot rebase

pytorchmergebot · 2023-12-16T13:55:12Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

pytorchmergebot · 2023-12-16T13:55:26Z

Successfully rebased gh/zhuhaozhe/4/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/115620)

zhuhaozhe · 2024-02-20T00:44:43Z

@pytorchbot merge

pytorchmergebot · 2024-02-20T00:46:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jeffdaily · 2024-02-20T17:17:39Z

This PR seems to have broken CUDA and ROCm CI jobs.

https://hud.pytorch.org/failure?name=pull%20%2F%20linux-focal-cuda12.1-py3.10-gcc9-sm86%20%2F%20test%20(default%2C%202%2C%205%2C%20linux.g5.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22inductor%2Ftest_torchinductor_dynamic_shapes.py%3A%3ATestInductorDynamicCUDA%3A%3Atest_full_cuda%22%5D

jeffdaily · 2024-02-20T17:20:35Z

@pytorchbot revert

pytorch-bot · 2024-02-20T17:20:39Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

jeffdaily · 2024-02-20T17:23:17Z

@pytorchbot revert -m "This PR seems to have broken CUDA and ROCm CI jobs; inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_full_cuda" -c landrace

pytorchmergebot · 2024-02-20T17:24:53Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-02-20T17:25:02Z

Reverting PR 115620 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit cc7ef43423fe36cf1778a9c9643454d62050a5b5 returned non-zero exit code 1

Auto-merging test/inductor/test_cpu_repro.py
CONFLICT (content): Merge conflict in test/inductor/test_cpu_repro.py
Auto-merging torch/_inductor/codegen/cpp.py
error: could not revert cc7ef43423f... use two pass reduction for deterministic reduction order (#115620)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".

Details for Dev Infra team

Raised by workflow job

…orch#115620)" This reverts commit cc7ef43.

) ## Motivation Address the [non-deterministic reduction order](pytorch#93542 (comment)) issue for `omp parallel reduction`. ## Latest update on 1.15: pytorch@55d8190. Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap. ``` vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0 vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4) ``` Examples code: ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); #pragma omp for for(...){ .... tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x; // access array will always from memory } } ``` will be changed to ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); **auto tmp0_acc_local = 0;** #pragma omp for for(...){ .... **tmp0_acc_local** = tmp0_acc_local + tmp_x; } **tmp0_acc_arr[tid] = tmp0_acc_local;** } ``` ## Descriptions Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order. https://github.com/pytorch/pytorch/blob/9c3ae37fc453505f5e437d1edadefdb278c2c39c/aten/src/ATen/Parallel-inl.h#L39 https://github.com/pytorch/pytorch/blob/9c3ae37fc453505f5e437d1edadefdb278c2c39c/aten/src/ATen/native/TensorIteratorReduce.cpp#L24 ``` float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); // init reduction buffer per thread float tmp_acc0_arr[64]; at::vec::Vectorized<float> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0)); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2 * tmp2; // reduce to per thread buffers tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3; } } // second pass reduce for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid]; tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0); ``` ## Test results I test this PR with dynamo benchmark on 32-core ICX system, Result (avg speed up): | | before this PR | after this PR | | ---- | ---- | ---- | | torchbench | 1.303 | 1.301 | | hugginface | 1.346 | 1.343 | | timms | 1.971 | 1.970 | ``` export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=1 multi_threads_test() { CORES=$(lscpu | grep Core | awk '{print $4}') export OMP_NUM_THREADS=$CORES end_core=$(expr $CORES - 1) numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv } SCENARIO=performance DT=float32 export TORCHINDUCTOR_FREEZING=1 Flag_extra="--freezing" Mode_extra="--inference" for suite in timm_models huggingface torchbench do export SUITE=$suite echo $SUITE export LOG_BASE=`date +%m%d%H%M%S` mkdir $LOG_BASE multi_threads_test done ``` System info ``` ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1.5 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 40 MiB (32 instances) L3: 54 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` Pull Request resolved: pytorch#115620 Approved by: https://github.com/jgong5, https://github.com/jansel

zhuhaozhe · 2024-02-27T06:56:02Z

Hi, @jeffdaily. Thanks for the reminder of CI failures.
However, I have some questions about https://hud.pytorch.org/failure?name=pull%20%2F%20linux-focal-cuda12.1-py3.10-gcc9-sm86%20%2F%20test%20(default%2C%202%2C%205%2C%20linux.g5.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22inductor%2Ftest_torchinductor_dynamic_shapes.py%3A%3ATestInductorDynamicCUDA%3A%3Atest_full_cuda%22%5D.
How does this shows this failure test_full_cuda is caused by this PR?

The error msg looks like env issues.

jeffdaily · 2024-02-27T18:05:04Z

My process is to look at the HUD. https://hud.pytorch.org/. My focus is on the ROCm failures since I'm an AMDer, but if I see a chunk of Xs that seem to correspond to a commit that was recently landed I will also look at more than just the ROCm CI jobs to see if it only affected ROCm or additional jobs like the CUDA CI.

The commit in questions is at the time of this writing on page 5 of the HUD.
https://hud.pytorch.org/hud/pytorch/pytorch/main/5?per_page=50

The commit before your PR was landed shows a mostly-green row of CI statuses. But the row that represents your commit shows new failures for both ROCm and CUDA. If you find the commit that represents the revert of your PR by atalman you'll see that some previous CI failures stopped occurring.

I don't dig any deeper than that. You'll need to try and reland your PR and look at the CI signal for your PR to see if you get a green signal.

## Motivation Address the [non-deterministic reduction order](#93542 (comment)) issue for `omp parallel reduction`. ## Latest update on 1.15: 55d8190. Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap. ``` vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0 vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4) ``` Examples code: ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); #pragma omp for for(...){ .... tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x; // access array will always from memory } } ``` will be changed to ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); **auto tmp0_acc_local = 0;** #pragma omp for for(...){ .... **tmp0_acc_local** = tmp0_acc_local + tmp_x; } **tmp0_acc_arr[tid] = tmp0_acc_local;** } ``` ## Descriptions Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order. https://github.com/pytorch/pytorch/blob/9c3ae37fc453505f5e437d1edadefdb278c2c39c/aten/src/ATen/Parallel-inl.h#L39 https://github.com/pytorch/pytorch/blob/9c3ae37fc453505f5e437d1edadefdb278c2c39c/aten/src/ATen/native/TensorIteratorReduce.cpp#L24 ``` float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); // init reduction buffer per thread float tmp_acc0_arr[64]; at::vec::Vectorized<float> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0)); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2 * tmp2; // reduce to per thread buffers tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3; } } // second pass reduce for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid]; tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0); ``` ## Test results I test this PR with dynamo benchmark on 32-core ICX system, Result (avg speed up): | | before this PR | after this PR | | ---- | ---- | ---- | | torchbench | 1.303 | 1.301 | | hugginface | 1.346 | 1.343 | | timms | 1.971 | 1.970 | ``` export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=1 multi_threads_test() { CORES=$(lscpu | grep Core | awk '{print $4}') export OMP_NUM_THREADS=$CORES end_core=$(expr $CORES - 1) numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv } SCENARIO=performance DT=float32 export TORCHINDUCTOR_FREEZING=1 Flag_extra="--freezing" Mode_extra="--inference" for suite in timm_models huggingface torchbench do export SUITE=$suite echo $SUITE export LOG_BASE=`date +%m%d%H%M%S` mkdir $LOG_BASE multi_threads_test done ``` System info ``` ubuntuip-172-31-18-205:~/hz/pytorch$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1.5 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 40 MiB (32 instances) L3: 54 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

zhuhaozhe · 2024-03-15T02:00:56Z

Hi, @jeffdaily
For this Job "pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu)"
I tried to reproduce on my local but failed.
Then I re-run it in CI for more than 10 times, and it all passed.

https://github.com/pytorch/pytorch/actions/runs/8199413332/job/22643061963?pr=115620
https://github.com/pytorch/pytorch/actions/runs/8199413332/job/22640053727?pr=115620
...
https://github.com/pytorch/pytorch/actions/runs/8199413332/job/22425358040?pr=115620

So I wish to merge this again. And revert it if this CI failure again.

zhuhaozhe · 2024-03-15T02:01:05Z

@pytorchmergebot merge

pytorchmergebot · 2024-03-15T02:02:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

zhuhaozhe · 2024-03-18T01:25:16Z

Looks like there is no failures during past 2 days after re-merge

https://hud.pytorch.org/failure?name=pull%20%2F%20linux-focal-cuda12.1-py3.10-gcc9-sm86%20%2F%20test%20(default%2C%202%2C%205%2C%20linux.g5.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22inductor%2Ftest_torchinductor_dynamic_shapes.py%3A%3ATestInductorDynamicCUDA%3A%3Atest_full_cuda%22%5D

github-actions bot added module: inductor ciflow/inductor labels Dec 12, 2023

zhuhaozhe marked this pull request as draft December 12, 2023 09:41

zhuhaozhe changed the title ~~use two pass reduction for deterministic reduction order~~ [WIP]use two pass reduction for deterministic reduction order Dec 12, 2023

use two pass reduction for deterministic reduction order

6c7f261

[ghstack-poisoned]

zhuhaozhe added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 12, 2023

pytorchbot added the open source label Dec 12, 2023

Update on "[WIP]use two pass reduction for deterministic reduction or…

0dd5c75

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Update on "[WIP]use two pass reduction for deterministic reduction or…

c3e8cad

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

zhuhaozhe added a commit that referenced this pull request Dec 13, 2023

use two pass reduction for deterministic reduction order

d191f7c

ghstack-source-id: 383973b Pull Request resolved: #115620

Update on "[WIP]use two pass reduction for deterministic reduction or…

cf1be15

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

zhuhaozhe added a commit that referenced this pull request Dec 13, 2023

use two pass reduction for deterministic reduction order

9d1a897

ghstack-source-id: 625ff4d Pull Request resolved: #115620

Update on "[WIP]use two pass reduction for deterministic reduction or…

814d002

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

zhuhaozhe added a commit that referenced this pull request Dec 14, 2023

use two pass reduction for deterministic reduction order

33b41c3

ghstack-source-id: c74f680 Pull Request resolved: #115620

Update on "[WIP]use two pass reduction for deterministic reduction or…

8a34e8c

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

zhuhaozhe added a commit that referenced this pull request Dec 14, 2023

use two pass reduction for deterministic reduction order

a39e053

ghstack-source-id: 409ffaf Pull Request resolved: #115620

Update on "[WIP]use two pass reduction for deterministic reduction or…

5a6cd67

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Update on "[WIP]use two pass reduction for deterministic reduction or…

8f271bf

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Dec 15, 2023

use two pass reduction for deterministic reduction order

4bbefd5

ghstack-source-id: 0b92585 Pull Request resolved: #115620

Update on "[WIP]use two pass reduction for deterministic reduction or…

ede3565

…der" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

zhuhaozhe added the topic: not user facing topic category label Feb 20, 2024

pytorchmergebot added the merging label Feb 20, 2024

pytorchmergebot closed this in cc7ef43 Feb 20, 2024

pytorchmergebot added Merged and removed merging labels Feb 20, 2024

atalman added a commit to atalman/pytorch that referenced this pull request Feb 20, 2024

Revert "use two pass reduction for deterministic reduction order (pyt…

d20c499

…orch#115620)" This reverts commit cc7ef43.

zhuhaozhe reopened this Mar 8, 2024

zhuhaozhe marked this pull request as draft March 8, 2024 05:57

zhuhaozhe mentioned this pull request Mar 20, 2024

[inductor]re-enable cpu reduction ut #122289

Closed

zhuhaozhe mentioned this pull request Apr 22, 2024

Type promotion for accumulate operation differs between eager and CPP dynamo #94017

Closed

This was referenced Aug 22, 2024

[Don't merge] fix dynamic size array(vla) build error on msvc v3 #134210

Closed

[inductor] fix dynamic size array(vla) build error on msvc v4 #134221

Closed

Conversation

zhuhaozhe commented Dec 12, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Latest update on 1.15:

Descriptions

Test results

Uh oh!

pytorch-bot bot commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115620

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

zhuhaozhe commented Dec 15, 2023

Uh oh!

pytorchmergebot commented Dec 15, 2023

Uh oh!

pytorchmergebot commented Dec 15, 2023

Uh oh!

zhuhaozhe commented Dec 16, 2023

Uh oh!

pytorchmergebot commented Dec 16, 2023

Uh oh!

pytorchmergebot commented Dec 16, 2023

Uh oh!

zhuhaozhe commented Feb 20, 2024

Uh oh!

pytorchmergebot commented Feb 20, 2024

Merge started

Uh oh!

jeffdaily commented Feb 20, 2024

Uh oh!

jeffdaily commented Feb 20, 2024

Uh oh!

pytorch-bot bot commented Feb 20, 2024

Uh oh!

jeffdaily commented Feb 20, 2024

Uh oh!

pytorchmergebot commented Feb 20, 2024

Uh oh!

pytorchmergebot commented Feb 20, 2024

Reverting PR 115620 failed

Uh oh!

zhuhaozhe commented Feb 27, 2024

Uh oh!

jeffdaily commented Feb 27, 2024

Uh oh!

zhuhaozhe commented Mar 15, 2024

Uh oh!

zhuhaozhe commented Mar 15, 2024

Uh oh!

pytorchmergebot commented Mar 15, 2024

Merge started

Uh oh!

zhuhaozhe commented Mar 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zhuhaozhe commented Dec 12, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 12, 2023 •

edited

Loading