Long build compilation time (>5 hours) for CUDA ARM build

### 🐛 Describe the bug

Issue summary:

As part of process to add CUDA ARM nightly wheel, we are seeing long build compilation time. Needs ~5hrs to compile https://github.com/pytorch/pytorch/actions/runs/9115630888/job/25074847028?pr=126174#step:16:1320. 

Need to figure out how to decrease the build compilation time. 

Description:

Initially, we are seeing OOM Error while building flash_attn in adding the https://github.com/pytorch/builder/pull/1775/files to nightly CI.

```
2024-04-26T02:20:01.5211732Z [6579/6896] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o�[K
2024-04-26T02:20:01.5215252Z �[31mFAILED: �[0mcaffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o 
2024-04-26T02:20:01.5250703Z /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUSPARSELT -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -DTORCH_ASSERT_NO_OPERATORS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/aten/src/THC -I/pytorch/aten/src/ATen/cuda -I/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/pytorch/build/caffe2/aten/src -I/pytorch/aten/src/ATen/.. -I/pytorch/build/nccl/include -I/pytorch/c10/cuda/../.. -I/pytorch/c10/.. -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -isystem /pytorch/build/third_party/gloo -isystem /pytorch/cmake/../third_party/gloo -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/third_party/gemmlowp -isystem /pytorch/third_party/neon2sse -isystem /pytorch/third_party/XNNPACK/include -isystem /pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /pytorch/third_party/ideep/include -isystem /pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_50,code=sm_50 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-unused-function,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-maybe-uninitialized -Wno-deprecated-copy -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o.d -x cu -c /pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o
```
2024-04-26T02:20:01.5283695Z /pytorch/aten/src/ATen/../../../third_party/cutlass/include/cute/layout.hpp(988): catastrophic error: out of memory
Error link - https://github.com/pytorch/pytorch/actions/runs/8840652730/job/24276381274?pr=124112.
Relevant PR for above error - https://github.com/pytorch/pytorch/pull/124112.

Tried set MAX_JOBS=4 (default is 6), no OOM error, but build takes >7 hours
Link - https://github.com/pytorch/pytorch/actions/runs/8970425814/job/24633947792

Now we are going with MAX_JOBS=5. 

Status:

Exploring the option with CMAKE JOB_POOLS for flash-attn only.

cc @malfet @seemethere @ptrblck @msaroufim @snadampal @atalman @Aidyn-A @nWEIdia 



### Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-72-generic-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          80
On-line CPU(s) list:             0-79
Vendor ID:                       ARM
Model name:                      Neoverse-N1
Model:                           1
Thread(s) per core:              1
Core(s) per cluster:             80
Socket(s):                       -
Cluster(s):                      1
Stepping:                        r3p1
Frequency boost:                 disabled
CPU max MHz:                     2800.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        50.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache:                       5 MiB (80 instances)
L1i cache:                       5 MiB (80 instances)
L2 cache:                        80 MiB (80 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-79
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; CSV2, BHB
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] onnx==1.16.0
[pip3] optree==0.11.0
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==3.0.0+989adb9a2
[pip3] torch==2.4.0a0+07cecf4168.nvinternal
[pip3] torch-tensorrt==2.4.0a0
[pip3] torchvision==0.19.0a0
[conda] Could not collect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long build compilation time (>5 hours) for CUDA ARM build #126980

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long build compilation time (>5 hours) for CUDA ARM build #126980

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions