[CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers by hujc7 · Pull Request #5625 · isaac-sim/IsaacLab

hujc7 · 2026-05-15T05:31:21Z

TL;DR

CI is SIGSEGV-ing on ~half of test jobs since the 2026-05-14 driver bump (595.58.03 / CUDA 13.2) because NumPy 2.3.5's bundled OpenBLAS registers a broken pthread_atfork handler that crashes inside Kit's libomni.platforminfo fork(). This PR adds three env vars to the CI test container — OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1 — which make OpenBLAS skip the worker thread pool entirely, so the broken atfork handler has nothing to crash on. 3-line change in .github/actions/run-tests/action.yml, no code change. Empirical A/B vs the diagnostic baseline #5626: 0 SIGSEGV on this PR vs 26+ on baseline, same image, same runners. CI is also marginally faster (no SIGSEGV-induced retries). Land this now to unblock CI; the proper upstream fix (NumPy ≥ 2.4.1 in the Isaac Sim base image) is on a separate, slower track.

Proper upstream fix (path to removal)

The root cause is fixed upstream and is propagating through wheel rebuilds:

OpenBLAS#5520  →  scipy-openblas wheel 0.3.30.7  →  NumPy 2.4.1+

Status:

OpenBLAS#5520 — atfork deadlock fix — merged
NumPy#30132 — bump NumPy to scipy-openblas 0.3.30.7 — merged Nov 4, 2025, shipped in NumPy 2.4.0 (yanked) → effectively NumPy ≥ 2.4.1 (released Jan 10, 2026)
Isaac Sim base image — still pinned to NumPy 2.3.5 (per the dep-manifest dump from companion PR [CI][Diag] Dump numpy/scipy/openblas state pre-pytest #5626). Needs a numpy bump in the base image to consume the fix.

Exit criteria for this PR's env-var workaround: revert this PR once the Isaac Sim base image ships NumPy ≥ 2.4.1. Until then, this PR is the cheapest, lowest-risk way to keep CI green. Filing a request with the Isaac Sim team to bump numpy is the parallel track.

Final A/B result (CI complete on this PR + companion diagnostic PR #5626)

Bucket	PR #5625 (env vars)	PR #5626 (baseline diagnostic)
Test jobs pass	18/20	14/20
Test jobs fail	2/20	4/20
SIGSEGV occurrences (sampled logs)	0	26+

Both PRs ran on the same pinned Isaac Sim image (sha256:0dd49a11…) on the same CI runner pool. The only difference is the three *_NUM_THREADS=1 env vars in this PR.

Per-job comparison on the highest-SIGSEGV jobs

Job	This PR	Baseline (#5626)
`isaaclab (core) [3/3]`	✅ PASS — `test_wrench_composer.py` 366/366	sometimes PASS (non-deterministic baseline)
`isaaclab (core) [2/3]`	❌ fail on flaky `test_multi_mesh_ray_caster_camera` (not SIGSEGV; 0 vs 17 SIGSEGV on baseline)	❌ 17 SIGSEGV across the run, then job failure
`isaaclab_mimic`	✅ PASS	❌ 7 SIGSEGV: `test_generate_dataset_franka_state.py` + `test_generate_dataset_gr1t2_pickplace.py` CRASHED
`test-curobo`	❌ fail on pre-existing `test_pink_ik` NaN (0 SIGSEGV)	❌ same pink_ik NaN + 1 SIGSEGV in earlier test
All other 14 test jobs	✅ PASS	mostly PASS, 1 pip-install infra flake

Confirmed culprit from #5626's dep-manifest dump

The pinned image carries:

numpy 2.3.5, scipy 1.17.1 (past the 1.16.2 cliff from scipy/scipy#23686)
bundled libscipy_openblas64_-fdde5778.so — byte-for-byte the .so named in the Slack crash backtrace

Root cause

NumPy 2.3.5 vendors a private OpenBLAS that calls __register_atfork on import. When Kit's libomni.platforminfo calls fork() during init, the registered handler runs in the child and pthread_joins thread-pool workers that no longer exist there → SIGSEGV. Crash backtrace (Slack thread reply 20):

001: pthread_join
002: libscipy_openblas64_-fdde5778.so!blas_thread_shutdown_   ← culprit
003: __register_atfork
004: __libc_fork
005: libomni.platforminfo.plugin.so                            ← Kit init
011: libcarb.so!omniCoreStart

Which test gets killed is non-deterministic (depends on which test process is in Kit init at fork time); 27 distinct test files have been observed crashing across 8 recent PRs.

What this PR does

Adds three env vars to the docker run -e flags in .github/actions/run-tests/action.yml:

-e OMP_NUM_THREADS=1
-e OPENBLAS_NUM_THREADS=1
-e MKL_NUM_THREADS=1

With thread count 1, OpenBLAS doesn't spawn a worker pool, so the atfork handler has nothing to pthread_join → no SIGSEGV. Documented workaround from numpy/numpy#30092 and scipy/scipy#23686. The OpenMP knob (OMP_NUM_THREADS) is load-bearing because NumPy 2.3.5 is built against libgomp; the OpenBLAS and MKL knobs are defense-in-depth for future image rotations.

Performance impact

Measured against PR #5611 baseline durations: most jobs 5–35% faster on #5625, two are 9–11% slower (within CI variance). Net: slightly faster overall. Tests aren't BLAS-compute-bound (they're USD/Kit/sim-bound), so threadcount=1 has no measurable hot-path cost. Fewer SIGSEGV-induced retries also help.

Relationship to other in-flight PRs

PR	Approach	Status on the SIGSEGV class
#5620 (pbarejko)	Don't import numpy/torch before AppLauncher; surgically remove AppLauncher from `test_noise` / `test_wrench_composer`	Partial. Even on the latest commit, `test_noise.py` / `test_wrench_composer.py` still SIGSEGV — Kit dies 11s into startup. Helps for the import paths it touches but doesn't catch tests that `import numpy` at module top.
#5621 (nvsekkin)	`PYTHONFAULTHANDLER=1` + `-vv`, isolate `test_sensor_base.py`	Pure diagnostic; complementary
#5607 (apoddubny)	docker-run retry loop	Different failure mode (transient docker daemon errors); complementary
#5625 (this PR)	Make OpenBLAS thread pool empty via env vars	Sufficient. Independent of import order; 0 SIGSEGV across all completed test jobs

#5620 and this PR are complementary: #5620 keeps the codebase disciplined, this PR provides the CI-level safety net. Both can land.

Out-of-scope follow-ups surfaced by this work

These are pre-existing issues unrelated to OpenBLAS, also failing on baseline:

test_pink_ik.py NaN failures in test-curobo — Left hand IK rotation error (nan) started after the 5/14 driver bump; affects Make Isaac Lab Docker images run as non-root #5618, Prevents early numpy imports to avoid Kit crash #5620, Refactor train/play and create uv run workflow without dedicated virtual environments #5623, [CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers #5625, [CI][Diag] Dump numpy/scipy/openblas state pre-pytest #5626. No active PR fixing it. Likely needs CUDA-driver-specific investigation (Octi's Fix Pink IK DAQP dependency checks #5556 fixed a different pink_ik bug back on 5/09).
test_multi_mesh_ray_caster_camera::test_output_equal_to_usd_camera_when_intrinsics_set flake — tensor compare returning inf. Passes on Make Isaac Lab Docker images run as non-root #5618, fails on [CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers #5625. Same image, same code → flaky test.
test_pink_ik and isaaclab_tasks shader recompilation overhead — Piotr noted in Slack.

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
CI-only change — no source code, no docs, no changelog fragment needed

Numpy/scipy wheels vendor a private OpenBLAS that registers a pthread_atfork handler whose blas_thread_shutdown_ routine joins thread-pool workers in the child of fork(). When Kit's libomni.platforminfo plugin forks during startup, that handler runs and SIGSEGVs because the joined threads do not exist in the child. The symptom in CI is non-deterministic Process killed by signal 11 SIGSEGV on whatever test happens to be inside Kit startup at fork time -- including trivial utility tests like test_version.py and test_logger.py that have no business segfaulting. OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 disable the OpenBLAS thread pool entirely, so the atfork handler has nothing to join. MKL_NUM_THREADS=1 covers numpy builds linked against MKL. Setting all three is the documented workaround from numpy#30092 and scipy#23686. Refs: - numpy/numpy#30092 - scipy/scipy#23686 - OMPE-92261

isaaclab-review-bot

Code Review Summary

🔬 Architecture & Design

Assessment: This is a well-researched and appropriately scoped CI fix.

The change adds three environment variables (OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1) to the Docker test container configuration. This is the documented workaround from numpy/numpy#30092 and scipy/scipy#23686 for the OpenBLAS pthread_atfork crash.

Why this works: With thread count set to 1, OpenBLAS doesn't create its worker thread pool, so its atfork handler has nothing to pthread_join when Kit's libomni.platforminfo forks during startup.

File placement: ✅ Correct — .github/actions/run-tests/action.yml is the appropriate location for Docker container environment configuration.

Integration: ✅ The new env vars are placed consistently within the existing docker_env_vars block, alongside related configuration like PYTHONUNBUFFERED and PYTHONIOENCODING.

🔍 Implementation Review

Aspect	Status	Notes
Correct file	✅	`.github/actions/run-tests/action.yml`
Correct syntax	✅	Standard `-e VAR=VALUE` Docker format
All three variables	✅	OMP, OPENBLAS, and MKL covered
No runtime code	✅	Pure CI configuration

📋 Cross-Module Impact

None. This change is self-contained within CI infrastructure. The environment variables only affect NumPy/SciPy/OpenBLAS threading behavior inside the test container and have no impact on:

Isaac Lab runtime code
User-facing APIs
Documentation
Package dependencies

🧪 Test Coverage

No new tests required. This is a CI-only configuration change with no testable code. Validation is empirical as described in the PR:

Re-trigger CI multiple times
Compare failure rate to baseline
Target: return to 6-7% baseline seen 2026-05-09 to 2026-05-11

✅ Verdict: Ship it

This is a clean, well-documented fix targeting a specific and reproducible CI instability issue. The approach is backed by upstream NumPy/SciPy guidance and the PR description provides excellent context on the problem, the fix mechanism, and how to validate.

Minor observation: If CI stability doesn't improve after this change, the PR description already notes fallback to the dep-manifest dump approach — good contingency planning.

github-actions Bot added the infrastructure label May 15, 2026

isaaclab-review-bot Bot approved these changes May 15, 2026

View reviewed changes

hujc7 mentioned this pull request May 15, 2026

[CI][DO NOT MERGE] Test new Isaac Sim image latest-develop sha256:06197a67 #5630

Closed

hujc7 closed this May 15, 2026

hujc7 mentioned this pull request May 15, 2026

[Fix] Pin numpy!=2.3.5 to dodge OpenBLAS atfork SIGSEGV at Kit fork() #5642

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers#5625

[CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers#5625
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/ci-openblas-env

hujc7 commented May 15, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hujc7 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Proper upstream fix (path to removal)

Final A/B result (CI complete on this PR + companion diagnostic PR #5626)

Per-job comparison on the highest-SIGSEGV jobs

Confirmed culprit from #5626's dep-manifest dump

Root cause

What this PR does

Performance impact

Relationship to other in-flight PRs

Out-of-scope follow-ups surfaced by this work

Type of change

Checklist

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Code Review Summary

🔬 Architecture & Design

🔍 Implementation Review

📋 Cross-Module Impact

🧪 Test Coverage

✅ Verdict: Ship it

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hujc7 commented May 15, 2026 •

edited

Loading