Skip to content

[CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers#5625

Closed
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/ci-openblas-env
Closed

[CI] Set OMP_/OPENBLAS_/MKL_NUM_THREADS=1 in test containers#5625
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/ci-openblas-env

Conversation

@hujc7

@hujc7 hujc7 commented May 15, 2026

Copy link
Copy Markdown
Collaborator

TL;DR

CI is SIGSEGV-ing on ~half of test jobs since the 2026-05-14 driver bump (595.58.03 / CUDA 13.2) because NumPy 2.3.5's bundled OpenBLAS registers a broken pthread_atfork handler that crashes inside Kit's libomni.platforminfo fork(). This PR adds three env vars to the CI test container — OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1 — which make OpenBLAS skip the worker thread pool entirely, so the broken atfork handler has nothing to crash on. 3-line change in .github/actions/run-tests/action.yml, no code change. Empirical A/B vs the diagnostic baseline #5626: 0 SIGSEGV on this PR vs 26+ on baseline, same image, same runners. CI is also marginally faster (no SIGSEGV-induced retries). Land this now to unblock CI; the proper upstream fix (NumPy ≥ 2.4.1 in the Isaac Sim base image) is on a separate, slower track.

Proper upstream fix (path to removal)

The root cause is fixed upstream and is propagating through wheel rebuilds:

OpenBLAS#5520  →  scipy-openblas wheel 0.3.30.7  →  NumPy 2.4.1+

Status:

  • OpenBLAS#5520 — atfork deadlock fix — merged
  • NumPy#30132 — bump NumPy to scipy-openblas 0.3.30.7 — merged Nov 4, 2025, shipped in NumPy 2.4.0 (yanked) → effectively NumPy ≥ 2.4.1 (released Jan 10, 2026)
  • Isaac Sim base image — still pinned to NumPy 2.3.5 (per the dep-manifest dump from companion PR [CI][Diag] Dump numpy/scipy/openblas state pre-pytest #5626). Needs a numpy bump in the base image to consume the fix.

Exit criteria for this PR's env-var workaround: revert this PR once the Isaac Sim base image ships NumPy ≥ 2.4.1. Until then, this PR is the cheapest, lowest-risk way to keep CI green. Filing a request with the Isaac Sim team to bump numpy is the parallel track.

Final A/B result (CI complete on this PR + companion diagnostic PR #5626)

Bucket PR #5625 (env vars) PR #5626 (baseline diagnostic)
Test jobs pass 18/20 14/20
Test jobs fail 2/20 4/20
SIGSEGV occurrences (sampled logs) 0 26+

Both PRs ran on the same pinned Isaac Sim image (sha256:0dd49a11…) on the same CI runner pool. The only difference is the three *_NUM_THREADS=1 env vars in this PR.

Per-job comparison on the highest-SIGSEGV jobs

Job This PR Baseline (#5626)
isaaclab (core) [3/3] ✅ PASS — test_wrench_composer.py 366/366 sometimes PASS (non-deterministic baseline)
isaaclab (core) [2/3] ❌ fail on flaky test_multi_mesh_ray_caster_camera (not SIGSEGV; 0 vs 17 SIGSEGV on baseline) ❌ 17 SIGSEGV across the run, then job failure
isaaclab_mimic ✅ PASS ❌ 7 SIGSEGV: test_generate_dataset_franka_state.py + test_generate_dataset_gr1t2_pickplace.py CRASHED
test-curobo ❌ fail on pre-existing test_pink_ik NaN (0 SIGSEGV) ❌ same pink_ik NaN + 1 SIGSEGV in earlier test
All other 14 test jobs ✅ PASS mostly PASS, 1 pip-install infra flake

Confirmed culprit from #5626's dep-manifest dump

The pinned image carries:

  • numpy 2.3.5, scipy 1.17.1 (past the 1.16.2 cliff from scipy/scipy#23686)
  • bundled libscipy_openblas64_-fdde5778.sobyte-for-byte the .so named in the Slack crash backtrace

Root cause

NumPy 2.3.5 vendors a private OpenBLAS that calls __register_atfork on import. When Kit's libomni.platforminfo calls fork() during init, the registered handler runs in the child and pthread_joins thread-pool workers that no longer exist there → SIGSEGV. Crash backtrace (Slack thread reply 20):

001: pthread_join
002: libscipy_openblas64_-fdde5778.so!blas_thread_shutdown_   ← culprit
003: __register_atfork
004: __libc_fork
005: libomni.platforminfo.plugin.so                            ← Kit init
011: libcarb.so!omniCoreStart

Which test gets killed is non-deterministic (depends on which test process is in Kit init at fork time); 27 distinct test files have been observed crashing across 8 recent PRs.

What this PR does

Adds three env vars to the docker run -e flags in .github/actions/run-tests/action.yml:

-e OMP_NUM_THREADS=1
-e OPENBLAS_NUM_THREADS=1
-e MKL_NUM_THREADS=1

With thread count 1, OpenBLAS doesn't spawn a worker pool, so the atfork handler has nothing to pthread_join → no SIGSEGV. Documented workaround from numpy/numpy#30092 and scipy/scipy#23686. The OpenMP knob (OMP_NUM_THREADS) is load-bearing because NumPy 2.3.5 is built against libgomp; the OpenBLAS and MKL knobs are defense-in-depth for future image rotations.

Performance impact

Measured against PR #5611 baseline durations: most jobs 5–35% faster on #5625, two are 9–11% slower (within CI variance). Net: slightly faster overall. Tests aren't BLAS-compute-bound (they're USD/Kit/sim-bound), so threadcount=1 has no measurable hot-path cost. Fewer SIGSEGV-induced retries also help.

Relationship to other in-flight PRs

PR Approach Status on the SIGSEGV class
#5620 (pbarejko) Don't import numpy/torch before AppLauncher; surgically remove AppLauncher from test_noise / test_wrench_composer Partial. Even on the latest commit, test_noise.py / test_wrench_composer.py still SIGSEGV — Kit dies 11s into startup. Helps for the import paths it touches but doesn't catch tests that import numpy at module top.
#5621 (nvsekkin) PYTHONFAULTHANDLER=1 + -vv, isolate test_sensor_base.py Pure diagnostic; complementary
#5607 (apoddubny) docker-run retry loop Different failure mode (transient docker daemon errors); complementary
#5625 (this PR) Make OpenBLAS thread pool empty via env vars Sufficient. Independent of import order; 0 SIGSEGV across all completed test jobs

#5620 and this PR are complementary: #5620 keeps the codebase disciplined, this PR provides the CI-level safety net. Both can land.

Out-of-scope follow-ups surfaced by this work

These are pre-existing issues unrelated to OpenBLAS, also failing on baseline:

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

Numpy/scipy wheels vendor a private OpenBLAS that registers a
pthread_atfork handler whose blas_thread_shutdown_ routine joins
thread-pool workers in the child of fork().  When Kit's
libomni.platforminfo plugin forks during startup, that handler runs
and SIGSEGVs because the joined threads do not exist in the child.

The symptom in CI is non-deterministic Process killed by signal 11
SIGSEGV on whatever test happens to be inside Kit startup at fork
time -- including trivial utility tests like test_version.py and
test_logger.py that have no business segfaulting.

OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 disable the OpenBLAS
thread pool entirely, so the atfork handler has nothing to join.
MKL_NUM_THREADS=1 covers numpy builds linked against MKL.  Setting
all three is the documented workaround from numpy#30092 and
scipy#23686.

Refs:
- numpy/numpy#30092
- scipy/scipy#23686
- OMPE-92261

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

🔬 Architecture & Design

Assessment: This is a well-researched and appropriately scoped CI fix.

The change adds three environment variables (OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1) to the Docker test container configuration. This is the documented workaround from numpy/numpy#30092 and scipy/scipy#23686 for the OpenBLAS pthread_atfork crash.

Why this works: With thread count set to 1, OpenBLAS doesn't create its worker thread pool, so its atfork handler has nothing to pthread_join when Kit's libomni.platforminfo forks during startup.

File placement: ✅ Correct — .github/actions/run-tests/action.yml is the appropriate location for Docker container environment configuration.

Integration: ✅ The new env vars are placed consistently within the existing docker_env_vars block, alongside related configuration like PYTHONUNBUFFERED and PYTHONIOENCODING.

🔍 Implementation Review

Aspect Status Notes
Correct file .github/actions/run-tests/action.yml
Correct syntax Standard -e VAR=VALUE Docker format
All three variables OMP, OPENBLAS, and MKL covered
No runtime code Pure CI configuration

📋 Cross-Module Impact

None. This change is self-contained within CI infrastructure. The environment variables only affect NumPy/SciPy/OpenBLAS threading behavior inside the test container and have no impact on:

  • Isaac Lab runtime code
  • User-facing APIs
  • Documentation
  • Package dependencies

🧪 Test Coverage

No new tests required. This is a CI-only configuration change with no testable code. Validation is empirical as described in the PR:

  1. Re-trigger CI multiple times
  2. Compare failure rate to baseline
  3. Target: return to 6-7% baseline seen 2026-05-09 to 2026-05-11

✅ Verdict: Ship it

This is a clean, well-documented fix targeting a specific and reproducible CI instability issue. The approach is backed by upstream NumPy/SciPy guidance and the PR description provides excellent context on the problem, the fix mechanism, and how to validate.

Minor observation: If CI stability doesn't improve after this change, the PR description already notes fallback to the dep-manifest dump approach — good contingency planning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant