-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Description
🐛 Describe the bug
Hi,
When I launched multi-process training (8x A100) using torchvision.datasets.ImageNet() with fresh-prepared root (i.e. containing only ILSVRC2012_devkit_t12.tar.gz, ILSVRC2012_img_train.tar, and ILSVRC2012_img_val.tar) I got this kind of errors:
$ NUMEXPR_MAX_THREADS=116 $PYTHON $MUPVIT_MAIN /data/ImageNet/ --workers $N_WORKERS --multiprocessing-distributed --batch-size 1024 --log-steps 100
Use GPU: 4 for training
Use GPU: 2 for training
Use GPU: 6 for training
Use GPU: 1 for training
Use GPU: 5 for training
Use GPU: 0 for training
Use GPU: 7 for training
Use GPU: 3 for training
[rank0]:[W1101 02:49:23.260690939 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any p
ending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been add
ed since PyTorch 2.4 (function operator())
W1101 02:49:24.409000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336321 via signal SIGTERM
W1101 02:49:24.409000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336322 via signal SIGTERM
W1101 02:49:24.411000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336323 via signal SIGTERM
W1101 02:49:24.413000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336324 via signal SIGTERM
W1101 02:49:24.417000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336326 via signal SIGTERM
W1101 02:49:24.418000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336327 via signal SIGTERM
W1101 02:49:24.422000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336328 via signal SIGTERM
Traceback (most recent call last):
File "/home/ubuntu/Downloads/mup-vit/main.py", line 713, in <module>
main()
File "/home/ubuntu/Downloads/mup-vit/main.py", line 189, in main
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args, ))
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/home/ubuntu/Downloads/mup-vit/main.py", line 337, in main_worker
train_dataset = datasets.ImageNet(args.data, split='train', transform=v2.Compose(transform))
File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/imagenet.py", line 53, in __init__
self.parse_archives()
File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/imagenet.py", line 70, in parse_archives
parse_train_archive(self.root)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/imagenet.py", line 183, in parse_train_archive
extract_archive(archive, os.path.splitext(archive)[0], remove_finished=True)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 362, in extract_archive
suffix, archive_type, compression = _detect_file_type(from_path)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 268, in _detect_file_type
raise RuntimeError(
RuntimeError: File '/data/ImageNet/train/n02104365' has no suffixes that could be used to detect the archive type and compression.The other errors are all about certain files exist already or don't exist yet. I am quite sure this is due to untar race condition and I worked around it by deleting all the intermediate files (meta.bin, train/ and val/ folders) and forcing a single process training launch to extract & place the files first.
It might be difficult to detect distributed training launch like this, but can we at least provide a warning in the documentation? Some tools to extract & place the files beforehand like python3 -m big_vision.tools.download_tfds_datasets imagenet2012 would also be helpful. This is kind of like the flip side of #2023.
Versions
The output below shows PyTorch version: 2.3.1 and torchvision==0.18.1, but I am actually running
$ pip freeze
(...)
torch==2.5.1
torchaudio==2.5.1
torchvision==0.20.1
(...)
I don't know why python3 -mpip list --format=freeze finds the old packages.
Collecting environment information...
PyTorch version: 2.3.1
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 550.90.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 124
On-line CPU(s) list: 0-123
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7542 32-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 124
Stepping: 0
BogoMIPS: 5800.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 7.8 MiB (124 instances)
L1i cache: 7.8 MiB (124 instances)
L2 cache: 62 MiB (124 instances)
L3 cache: 1.9 GiB (124 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-123
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flake8==4.0.1
[pip3] numpy==1.21.5
[pip3] optree==0.12.1
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] triton==2.3.1
[conda] Could not collect