Describe the issue:
blas_fpe_check() in numpy/__init__.py (added in #30102, backported to 2.3.5) executes ones((20,20)) @ ones((20,20)) at import time. When numpy is imported from within a dlopen() context — e.g. a C extension's static constructor calling PyImport_ImportModule("numpy") — this causes a deadlock on glibc < 2.34 with MKL BLAS.
Mechanism:
- Python loads a C extension via
dlopen() — glibc acquires dl_load_lock
- The extension's static constructor calls
PyImport_ImportModule("numpy")
- numpy's
__init__.py runs blas_fpe_check() → x @ x → cblas_dgemm
- MKL dispatches
dgemm via OpenMP, spawning worker threads
- Worker threads call
mkl_serv_load_fun() → dlsym() → tries to acquire dl_load_lock
- Deadlock: main thread holds
dl_load_lock and waits at __kmp_join_barrier for workers; workers block on dl_load_lock held by main thread
On glibc ≥ 2.34, dlopen/dlsym use a recursive lock (dl_load_lock2) so this doesn't deadlock. On glibc < 2.34 (RHEL 8, CentOS 8, Amazon Linux 2, etc.), the lock is non-reentrant.
Prior to #30102, no BLAS computation happened at import time, so this was not an issue.
Reproduce the code example:
# Create environment with MKL-backed numpy and any C extension that
# imports numpy from a static constructor (csp does this):
conda create -n test python=3.11 csp "blas=*=mkl" "numpy=2.3.5"
# repro.py — deadlocks on glibc < 2.34
import importlib.util
import os
import sysconfig
site_packages = sysconfig.get_path("platlib")
so_path = os.path.join(site_packages, "csp", "lib", "_cspimpl.so")
# Loading _cspimpl.so via dlopen triggers its C++ static constructors,
# which call PyImport_ImportModule("numpy"). This worked before 2.3.5.
spec = importlib.util.spec_from_file_location("_cspimpl", so_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod) # hangs forever
print("OK")
**Workaround** — pre-import numpy before loading the extension:
import numpy # ensures blas_fpe_check runs outside dlopen context
import csp # now safe
Error message:
From GDB
Main thread (waiting for OpenMP workers):
#0 pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 __kmp_suspend_64 () from libiomp5.so
#2 __kmp_wait_template () from libiomp5.so
#3 __kmp_hyper_barrier_gather () from libiomp5.so
#4 __kmp_join_barrier () from libiomp5.so
#5 __kmp_internal_join () from libiomp5.so
#6 __kmp_join_call () from libiomp5.so
#7 __kmpc_fork_call () from libiomp5.so
#8 mkl_blas_dgemm_omp_driver_v1 () from libmkl_intel_thread.so.2
#9 mkl_blas.dgemm () from libmkl_gf_lp64.so.2
#10 cblas_dgemm () from libmkl_gf_lp64.so.2
#11 DOUBLE_matmul_matrixmatrix () from _multiarray_umath.so <-- x @ x
...
#36 import_find_and_load (abs_name='numpy') <-- PyImport_ImportModule
Worker threads (blocked on dl_load_lock):
#0 __lll_lock_wait () from /lib64/libpthread.so.0
#1 pthread_mutex_lock () from /lib64/libpthread.so.0
#2 dlsym () from /lib64/libdl.so.2
#3 mkl_serv_load_fun () from libmkl_core.so.2
#4 mkl_blas_dgemm_mscale () from libmkl_core.so.2
#5 mkl_blas_dgemm_omp_driver_v1.extracted () from libmkl_intel_thread.so.2
Python and NumPy Versions:
- numpy 2.3.5 / 2.4.3 (conda-forge, MKL variant via
blas=*=mkl)
- Python 3.11.15
- glibc 2.28 (RHEL 8.10, kernel 4.18.0-553)
- MKL 2025.3.1, LLVM OpenMP 22.1.3
- x86_64
Bisect results:
| numpy version |
blas_fpe_check |
Deadlocks on glibc 2.28 + MKL? |
| 2.2.6 |
No |
No |
| 2.3.0 – 2.3.4 |
No |
No |
| 2.3.5 |
Yes (backport) |
Yes |
| 2.4.0 – 2.4.3 |
Yes |
Yes |
Runtime Environment:
No response
How does this issue affect you or how did you find it:
Suggested fix:
Guard blas_fpe_check() to only run on platforms where it's needed (ARM/Apple Silicon with Accelerate), or set MKL_NUM_THREADS=1 / OMP_NUM_THREADS=1 for the duration of the check to prevent MKL from spawning worker threads:
def blas_fpe_check():
with errstate(all='raise'):
x = ones((20, 20))
try:
# Avoid spawning OpenMP threads during import, which deadlocks
# on glibc < 2.34 if numpy is loaded from within dlopen().
import os
old = os.environ.get("MKL_NUM_THREADS")
os.environ["MKL_NUM_THREADS"] = "1"
try:
x @ x
finally:
if old is None:
os.environ.pop("MKL_NUM_THREADS", None)
else:
os.environ["MKL_NUM_THREADS"] = old
except FloatingPointError:
...
Describe the issue:
blas_fpe_check()innumpy/__init__.py(added in #30102, backported to 2.3.5) executesones((20,20)) @ ones((20,20))at import time. When numpy is imported from within adlopen()context — e.g. a C extension's static constructor callingPyImport_ImportModule("numpy")— this causes a deadlock on glibc < 2.34 with MKL BLAS.Mechanism:
dlopen()— glibc acquiresdl_load_lockPyImport_ImportModule("numpy")__init__.pyrunsblas_fpe_check()→x @ x→cblas_dgemmdgemmvia OpenMP, spawning worker threadsmkl_serv_load_fun()→dlsym()→ tries to acquiredl_load_lockdl_load_lockand waits at__kmp_join_barrierfor workers; workers block ondl_load_lockheld by main threadOn glibc ≥ 2.34,
dlopen/dlsymuse a recursive lock (dl_load_lock2) so this doesn't deadlock. On glibc < 2.34 (RHEL 8, CentOS 8, Amazon Linux 2, etc.), the lock is non-reentrant.Prior to #30102, no BLAS computation happened at import time, so this was not an issue.
Reproduce the code example:
Error message:
Python and NumPy Versions:
blas=*=mkl)Bisect results:
blas_fpe_checkRuntime Environment:
No response
How does this issue affect you or how did you find it:
Suggested fix:
Guard
blas_fpe_check()to only run on platforms where it's needed (ARM/Apple Silicon with Accelerate), or setMKL_NUM_THREADS=1/OMP_NUM_THREADS=1for the duration of the check to prevent MKL from spawning worker threads: