[CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers#5625
Conversation
Numpy/scipy wheels vendor a private OpenBLAS that registers a pthread_atfork handler whose blas_thread_shutdown_ routine joins thread-pool workers in the child of fork(). When Kit's libomni.platforminfo plugin forks during startup, that handler runs and SIGSEGVs because the joined threads do not exist in the child. The symptom in CI is non-deterministic Process killed by signal 11 SIGSEGV on whatever test happens to be inside Kit startup at fork time -- including trivial utility tests like test_version.py and test_logger.py that have no business segfaulting. OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 disable the OpenBLAS thread pool entirely, so the atfork handler has nothing to join. MKL_NUM_THREADS=1 covers numpy builds linked against MKL. Setting all three is the documented workaround from numpy#30092 and scipy#23686. Refs: - numpy/numpy#30092 - scipy/scipy#23686 - OMPE-92261
There was a problem hiding this comment.
Code Review Summary
🔬 Architecture & Design
Assessment: This is a well-researched and appropriately scoped CI fix.
The change adds three environment variables (OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1) to the Docker test container configuration. This is the documented workaround from numpy/numpy#30092 and scipy/scipy#23686 for the OpenBLAS pthread_atfork crash.
Why this works: With thread count set to 1, OpenBLAS doesn't create its worker thread pool, so its atfork handler has nothing to pthread_join when Kit's libomni.platforminfo forks during startup.
File placement: ✅ Correct — .github/actions/run-tests/action.yml is the appropriate location for Docker container environment configuration.
Integration: ✅ The new env vars are placed consistently within the existing docker_env_vars block, alongside related configuration like PYTHONUNBUFFERED and PYTHONIOENCODING.
🔍 Implementation Review
| Aspect | Status | Notes |
|---|---|---|
| Correct file | ✅ | .github/actions/run-tests/action.yml |
| Correct syntax | ✅ | Standard -e VAR=VALUE Docker format |
| All three variables | ✅ | OMP, OPENBLAS, and MKL covered |
| No runtime code | ✅ | Pure CI configuration |
📋 Cross-Module Impact
None. This change is self-contained within CI infrastructure. The environment variables only affect NumPy/SciPy/OpenBLAS threading behavior inside the test container and have no impact on:
- Isaac Lab runtime code
- User-facing APIs
- Documentation
- Package dependencies
🧪 Test Coverage
No new tests required. This is a CI-only configuration change with no testable code. Validation is empirical as described in the PR:
- Re-trigger CI multiple times
- Compare failure rate to baseline
- Target: return to 6-7% baseline seen 2026-05-09 to 2026-05-11
✅ Verdict: Ship it
This is a clean, well-documented fix targeting a specific and reproducible CI instability issue. The approach is backed by upstream NumPy/SciPy guidance and the PR description provides excellent context on the problem, the fix mechanism, and how to validate.
Minor observation: If CI stability doesn't improve after this change, the PR description already notes fallback to the dep-manifest dump approach — good contingency planning.
TL;DR
CI is SIGSEGV-ing on ~half of test jobs since the 2026-05-14 driver bump (595.58.03 / CUDA 13.2) because NumPy 2.3.5's bundled OpenBLAS registers a broken
pthread_atforkhandler that crashes inside Kit'slibomni.platforminfofork(). This PR adds three env vars to the CI test container —OMP_NUM_THREADS=1,OPENBLAS_NUM_THREADS=1,MKL_NUM_THREADS=1— which make OpenBLAS skip the worker thread pool entirely, so the broken atfork handler has nothing to crash on. 3-line change in.github/actions/run-tests/action.yml, no code change. Empirical A/B vs the diagnostic baseline #5626: 0 SIGSEGV on this PR vs 26+ on baseline, same image, same runners. CI is also marginally faster (no SIGSEGV-induced retries). Land this now to unblock CI; the proper upstream fix (NumPy ≥ 2.4.1 in the Isaac Sim base image) is on a separate, slower track.Proper upstream fix (path to removal)
The root cause is fixed upstream and is propagating through wheel rebuilds:
Status:
Exit criteria for this PR's env-var workaround: revert this PR once the Isaac Sim base image ships NumPy ≥ 2.4.1. Until then, this PR is the cheapest, lowest-risk way to keep CI green. Filing a request with the Isaac Sim team to bump numpy is the parallel track.
Final A/B result (CI complete on this PR + companion diagnostic PR #5626)
Both PRs ran on the same pinned Isaac Sim image (
sha256:0dd49a11…) on the same CI runner pool. The only difference is the three*_NUM_THREADS=1env vars in this PR.Per-job comparison on the highest-SIGSEGV jobs
isaaclab (core) [3/3]test_wrench_composer.py366/366isaaclab (core) [2/3]test_multi_mesh_ray_caster_camera(not SIGSEGV; 0 vs 17 SIGSEGV on baseline)isaaclab_mimictest_generate_dataset_franka_state.py+test_generate_dataset_gr1t2_pickplace.pyCRASHEDtest-curobotest_pink_ikNaN (0 SIGSEGV)Confirmed culprit from #5626's dep-manifest dump
The pinned image carries:
libscipy_openblas64_-fdde5778.so— byte-for-byte the.sonamed in the Slack crash backtraceRoot cause
NumPy 2.3.5 vendors a private OpenBLAS that calls
__register_atforkon import. When Kit'slibomni.platforminfocallsfork()during init, the registered handler runs in the child andpthread_joins thread-pool workers that no longer exist there → SIGSEGV. Crash backtrace (Slack thread reply 20):Which test gets killed is non-deterministic (depends on which test process is in Kit init at fork time); 27 distinct test files have been observed crashing across 8 recent PRs.
What this PR does
Adds three env vars to the docker
run -eflags in.github/actions/run-tests/action.yml:With thread count 1, OpenBLAS doesn't spawn a worker pool, so the atfork handler has nothing to
pthread_join→ no SIGSEGV. Documented workaround from numpy/numpy#30092 and scipy/scipy#23686. The OpenMP knob (OMP_NUM_THREADS) is load-bearing because NumPy 2.3.5 is built against libgomp; the OpenBLAS and MKL knobs are defense-in-depth for future image rotations.Performance impact
Measured against PR #5611 baseline durations: most jobs 5–35% faster on #5625, two are 9–11% slower (within CI variance). Net: slightly faster overall. Tests aren't BLAS-compute-bound (they're USD/Kit/sim-bound), so threadcount=1 has no measurable hot-path cost. Fewer SIGSEGV-induced retries also help.
Relationship to other in-flight PRs
test_noise/test_wrench_composertest_noise.py/test_wrench_composer.pystill SIGSEGV — Kit dies 11s into startup. Helps for the import paths it touches but doesn't catch tests thatimport numpyat module top.PYTHONFAULTHANDLER=1+-vv, isolatetest_sensor_base.py#5620 and this PR are complementary: #5620 keeps the codebase disciplined, this PR provides the CI-level safety net. Both can land.
Out-of-scope follow-ups surfaced by this work
These are pre-existing issues unrelated to OpenBLAS, also failing on baseline:
test_pink_ik.pyNaN failures intest-curobo—Left hand IK rotation error (nan)started after the 5/14 driver bump; affects Make Isaac Lab Docker images run as non-root #5618, Prevents early numpy imports to avoid Kit crash #5620, Refactor train/play and create uv run workflow without dedicated virtual environments #5623, [CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers #5625, [CI][Diag] Dump numpy/scipy/openblas state pre-pytest #5626. No active PR fixing it. Likely needs CUDA-driver-specific investigation (Octi's Fix Pink IK DAQP dependency checks #5556 fixed a different pink_ik bug back on 5/09).test_multi_mesh_ray_caster_camera::test_output_equal_to_usd_camera_when_intrinsics_setflake — tensor compare returninginf. Passes on Make Isaac Lab Docker images run as non-root #5618, fails on [CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers #5625. Same image, same code → flaky test.test_pink_ikandisaaclab_tasksshader recompilation overhead — Piotr noted in Slack.Type of change
Checklist
pre-commitchecks with./isaaclab.sh --format