Basic math operations produce a "floating point exception"

### 🐛 Describe the bug

When I try to run the following simple piece of code:

```python
import numpy as np
import torch

np.random.seed(42)

x = torch.from_numpy(np.random.rand(100)).float()
print(x)

exp_x = torch.exp(x)
print(exp_x)
```

I get a `floating point exception` that kills my Python interpreter:
```
(venv) [tgebhard@g108] ~ % python test.py                                                                                                                                                                   
tensor([0.3745, 0.9507, 0.7320, 0.5987, 0.1560, 0.1560, 0.0581, 0.8662, 0.6011,
        0.7081, 0.0206, 0.9699, 0.8324, 0.2123, 0.1818, 0.1834, 0.3042, 0.5248,
        0.4319, 0.2912, 0.6119, 0.1395, 0.2921, 0.3664, 0.4561, 0.7852, 0.1997,
        0.5142, 0.5924, 0.0465, 0.6075, 0.1705, 0.0651, 0.9489, 0.9656, 0.8084,
        0.3046, 0.0977, 0.6842, 0.4402, 0.1220, 0.4952, 0.0344, 0.9093, 0.2588,
        0.6625, 0.3117, 0.5201, 0.5467, 0.1849, 0.9696, 0.7751, 0.9395, 0.8948,
        0.5979, 0.9219, 0.0885, 0.1960, 0.0452, 0.3253, 0.3887, 0.2713, 0.8287,
        0.3568, 0.2809, 0.5427, 0.1409, 0.8022, 0.0746, 0.9869, 0.7722, 0.1987,
        0.0055, 0.8155, 0.7069, 0.7290, 0.7713, 0.0740, 0.3585, 0.1159, 0.8631,
        0.6233, 0.3309, 0.0636, 0.3110, 0.3252, 0.7296, 0.6376, 0.8872, 0.4722,
        0.1196, 0.7132, 0.7608, 0.5613, 0.7710, 0.4938, 0.5227, 0.4275, 0.0254,
        0.1079])
zsh: floating point exception  python test.py
(venv) [tgebhard@g108] ~ %
```

The problem also occurs for other mathematical operations such as `torch.log()` or `torch.cos()`. It seems like it only happens if the size of the input tensor is at least 100, though.

**Moreover, the issue only occurs on _some_ machines, under some specific circumstances:** My local machine will run the code above without any problem, but one of the machines at work reproducibly gives the error above, but _only_ if I request at least 14 CPU cores (it's a batch queue system based on HTCondor). It might, therefore, be the case that only this particular machine has a problem. Any pointers for debugging this are greatly appreciated! 🙂 

### Versions

Information about the Python environment:
```
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun  2 2021, 10:49:15)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-80-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No devices found.
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-lightning==1.8.3.post0
[pip3] torch==1.13.0
[pip3] torchmetrics==0.10.3
[conda] Could not collect
```

Information about the machine where the problem occurs (output of `lscpu`):
```
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0,1,9-16,26-31
Off-line CPU(s) list:            2-8,17-25,32-255
Thread(s) per core:              0
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7662 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         1499.941
CPU max MHz:                     2000.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        3999.98
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-63,128-191
NUMA node1 CPU(s):               64-127,192-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpui
                                 d extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dn
                                 owprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2
                                  cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock n
                                 rip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
```

cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic math operations produce a "floating point exception" #89817

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Basic math operations produce a "floating point exception" #89817

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions