-
Notifications
You must be signed in to change notification settings - Fork 27.4k
Basic math operations produce a "floating point exception" #89817
Description
🐛 Describe the bug
When I try to run the following simple piece of code:
import numpy as np
import torch
np.random.seed(42)
x = torch.from_numpy(np.random.rand(100)).float()
print(x)
exp_x = torch.exp(x)
print(exp_x)I get a floating point exception that kills my Python interpreter:
(venv) [tgebhard@g108] ~ % python test.py
tensor([0.3745, 0.9507, 0.7320, 0.5987, 0.1560, 0.1560, 0.0581, 0.8662, 0.6011,
0.7081, 0.0206, 0.9699, 0.8324, 0.2123, 0.1818, 0.1834, 0.3042, 0.5248,
0.4319, 0.2912, 0.6119, 0.1395, 0.2921, 0.3664, 0.4561, 0.7852, 0.1997,
0.5142, 0.5924, 0.0465, 0.6075, 0.1705, 0.0651, 0.9489, 0.9656, 0.8084,
0.3046, 0.0977, 0.6842, 0.4402, 0.1220, 0.4952, 0.0344, 0.9093, 0.2588,
0.6625, 0.3117, 0.5201, 0.5467, 0.1849, 0.9696, 0.7751, 0.9395, 0.8948,
0.5979, 0.9219, 0.0885, 0.1960, 0.0452, 0.3253, 0.3887, 0.2713, 0.8287,
0.3568, 0.2809, 0.5427, 0.1409, 0.8022, 0.0746, 0.9869, 0.7722, 0.1987,
0.0055, 0.8155, 0.7069, 0.7290, 0.7713, 0.0740, 0.3585, 0.1159, 0.8631,
0.6233, 0.3309, 0.0636, 0.3110, 0.3252, 0.7296, 0.6376, 0.8872, 0.4722,
0.1196, 0.7132, 0.7608, 0.5613, 0.7710, 0.4938, 0.5227, 0.4275, 0.0254,
0.1079])
zsh: floating point exception python test.py
(venv) [tgebhard@g108] ~ %
The problem also occurs for other mathematical operations such as torch.log() or torch.cos(). It seems like it only happens if the size of the input tensor is at least 100, though.
Moreover, the issue only occurs on some machines, under some specific circumstances: My local machine will run the code above without any problem, but one of the machines at work reproducibly gives the error above, but only if I request at least 14 CPU cores (it's a batch queue system based on HTCondor). It might, therefore, be the case that only this particular machine has a problem. Any pointers for debugging this are greatly appreciated! 🙂
Versions
Information about the Python environment:
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-80-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No devices found.
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-lightning==1.8.3.post0
[pip3] torch==1.13.0
[pip3] torchmetrics==0.10.3
[conda] Could not collect
Information about the machine where the problem occurs (output of lscpu):
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0,1,9-16,26-31
Off-line CPU(s) list: 2-8,17-25,32-255
Thread(s) per core: 0
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7662 64-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 1499.941
CPU max MHz: 2000.0000
CPU min MHz: 1500.0000
BogoMIPS: 3999.98
Virtualization: AMD-V
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 32 MiB
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpui
d extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dn
owprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2
cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock n
rip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10
Metadata
Metadata
Labels
Type
Projects
Status