Updates installation logic for pip_prebundle#5279
Conversation
- Add comprehensive unit tests for `_repoint_prebundle_packages`, `_ensure_cuda_torch`, and `_maybe_uninstall_torch_if_exts_deprecated` in a new test_install_commands.py (42 tests covering all `_PREBUNDLE_REPOINT_PACKAGES` combinations, conda/venv/kit Python environments, arm vs x86, and the nvidia-cudnn guard) - Extend test_install.py with three new `TestExtractPythonExe` tests covering the default venv, kit Python, and system Python fallback paths - Add `pull_request` trigger to install-ci.yml so the installation CI runs automatically when install.py, utils.py, or test files change; add `|| 'ubuntu:24.04'` fallback for `BASE_IMAGE` so PR-triggered runs (where `inputs` is unpopulated) get the correct default image
Greptile SummaryThis PR fixes a bug in
Confidence Score: 5/5Safe to merge; the bug fix logic is correct and the only findings are a missing changelog entry and absent tests claimed in the PR description. No P0 or P1 issues found. The nvidia-namespace guard (checking cudnn) is logically sound and correctly scoped. The P2 findings are a missing Fixed changelog entry for the primary fix, a version bump that should accompany it, and test/CI additions described in the PR body that are absent from the actual diff. These are process gaps, not code defects. source/isaaclab/docs/CHANGELOG.rst and source/isaaclab/config/extension.toml — need a new patch version entry for the nvidia repoint bug fix. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[command_install] --> B[_install_system_deps]
A --> C[Filter PYTHONPATH\nremove pip_prebundle paths]
C --> D[Build probe_env\nwith original PYTHONPATH]
D --> E[_maybe_uninstall_torch_if_\nextsDeprecated_prebundle_first_on_path]
E --> F{torch under\nextsDeprecated?}
F -- Yes --> G[pip uninstall torch/torchvision/torchaudio]
F -- No --> H[Continue]
G --> H
H --> I[_ensure_cuda_torch]
I --> J{torch version\nalready correct?}
J -- Yes --> K[Skip install]
J -- No --> L[pip install torch+cuXXX]
K --> M[_install_isaaclab_submodules]
L --> M
M --> N[_install_extra_frameworks]
N --> O[_ensure_cuda_torch again]
O --> P[_repoint_prebundle_packages]
P --> Q{pkg == nvidia?}
Q -- Yes --> R{venv_pkg/cudnn\nexists?}
R -- No --> S[SKIP: would strip CUDA libs - BUG FIX]
R -- Yes --> T[Repoint nvidia normal flow]
Q -- No --> T
T --> U{prebundled is symlink?}
U -- Yes, same target --> V[Leave untouched - idempotent]
U -- Yes, different --> W[unlink + re-symlink]
U -- No --> X[backup .bak then symlink]
Reviews (1): Last reviewed commit: "Updates installation logic for pip_prebu..." | Re-trigger Greptile |
| 4.6.5 (2026-04-14) | ||
| ~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Changed | ||
| ^^^^^^^ | ||
|
|
||
| * During :func:`~isaaclab.cli.commands.install.command_install`, detect when the first | ||
| ``torch`` on ``sys.path`` would come from an ``extsDeprecated`` extension prebundle | ||
| (for example ``omni.isaac.ml_archive``) and run ``pip uninstall`` on ``torch``, | ||
| ``torchvision``, and ``torchaudio`` before installing pinned CUDA wheels. | ||
|
|
There was a problem hiding this comment.
Missing
Fixed changelog entry for nvidia-namespace repoint bug
The 4.6.5 entry documents only the extsDeprecated torch-detection change under Changed, but the primary bug fix in this PR — skipping the nvidia repoint when cudnn is absent — has no entry. Per project guidelines, every change targeting the source directory requires a corresponding changelog entry. A new patch version (e.g. 4.6.6) should be created with a Fixed section describing this fix, and source/isaaclab/config/extension.toml should be bumped to match.
| if pkg_name == "nvidia" and not (venv_pkg / "cudnn").exists(): | ||
| print_debug(f"Skipping repoint of {prebundled}: {venv_pkg} lacks CUDA subpackages (cudnn missing).") | ||
| continue |
There was a problem hiding this comment.
Consider also handling other CUDA-critical nvidia subpackages
The guard uses cudnn as the sole indicator of a CUDA-capable nvidia namespace. If a future environment installs CUDA libraries under a different subpackage name (e.g. cublas, curand) without cudnn, the check would incorrectly allow the repoint. A more defensive approach would be to check for any known CUDA subpackage rather than cudnn alone:
_CUDA_NVIDIA_SUBPKGS = {"cudnn", "cublas", "curand", "cufft"}
if pkg_name == "nvidia" and not any((venv_pkg / sub).exists() for sub in _CUDA_NVIDIA_SUBPKGS):
print_debug(f"Skipping repoint of {prebundled}: {venv_pkg} lacks CUDA subpackages.")
continueThis is a minor hardening suggestion; the cudnn check is a reasonable minimal indicator for the immediate bug.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes a critical ImportError: libcudnn.so.9 caused by _repoint_prebundle_packages() incorrectly replacing the entire nvidia namespace package in pip_prebundle with a symlink to site-packages/nvidia/ (which only contains srl). The fix correctly skips repointing the nvidia namespace when the target lacks CUDA subpackages. Alongside this, the PR migrates deprecated Isaac Sim extension imports to isaacsim.core.experimental.*, fixes Docker _isaac_sim symlink handling, adds a CUDA synchronize for tiled camera GPU race conditions, and retires several test scripts that depended on deprecated extensions.
Design Assessment
Design is sound. The core approach — checking for cudnn as a sentinel before repointing the nvidia namespace — is a pragmatic and correct fix. The nvidia namespace package is uniquely problematic because it spans dozens of distributions (nvidia-cudnn-cu12, nvidia-cublas-cu12, nvidia-srl, etc.), so blanket symlink replacement is inherently unsafe when the target only has a subset. The cudnn check is the right minimal indicator.
The Isaac Sim API migration (isaacsim.core.utils.* → isaacsim.core.experimental.utils.*) is straightforward and necessary as Isaac Sim 6.0 deprecates the old paths. The Docker symlink fix correctly addresses the .dockerignore glob bug (_isaac_sim? matches one-extra-char, not _isaac_sim itself).
The new _maybe_uninstall_torch_if_exts_deprecated_prebundle_first_on_path function is a reasonable defensive measure — detecting when sys.path ordering would load torch from a deprecated prebundle and proactively uninstalling before reinstalling the correct CUDA version.
Findings
🟡 Warning: CUDA synchronize per data type inside the render loop may regress performance — source/isaaclab_physx/isaaclab_physx/renderers/isaac_rtx_renderer.py:317
The torch.cuda.synchronize() call is placed inside the for data_type in ... loop, meaning it synchronizes once per annotator type (rgb, depth, normals, etc.). For cameras with multiple data types, this serializes the GPU pipeline multiple times per frame. While this fixes the cudaErrorIllegalAddress race condition, it could measurably impact training throughput in environments with multiple camera outputs.
Consider moving the synchronize call outside the data-type loop (after all warp kernels have been launched) so there's only one sync point per camera per frame. Alternatively, using torch.cuda.current_stream().synchronize() or a CUDA event-based fence would be more targeted.
🟡 Warning: check_pva_sensor.py — API signature change for ViewportManager.set_camera_view — source/isaaclab_physx/test/sensors/check_pva_sensor.py:122
The old API was set_camera_view(eye, target) (positional). The new API is ViewportManager.set_camera_view(camera_prim_path, eye=..., target=...) with the camera path as the first positional argument. This is correct for the new experimental API, but the isaacsim_kit_perspective_video.py call at line 37 passes arguments as (camera_prim_path, eye=..., target=...) while check_pva_sensor.py passes as ("/OmniverseKit_Persp", eye=[...], target=[...]). Both look correct, just noting the API requires the camera prim path as the first arg now.
🟡 Warning: xr_anchor_manager.py — broad except Exception when creating XR anchor prim — source/isaaclab_teleop/isaaclab_teleop/xr_anchor_manager.py:87
The XformPrim constructor call is wrapped in except Exception as e with only a logger.warning. While this was pre-existing, the API migration changes the constructor signature (now takes positions and orientations as batch arrays instead of position/orientation). If the new API rejects the arguments silently (e.g., wrong shape), the warning may not provide enough context for debugging. The numpy reshape to (1, 3) and (1, 4) looks correct for the experimental API's batched interface.
🔵 Suggestion: _torch_first_on_sys_path_is_exts_deprecated could log stderr on unexpected failures — source/isaaclab/isaaclab/cli/commands/install.py:80
The probe subprocess runs with check=False and only inspects returncode == 1. If the probe script fails for an unrelated reason (e.g., Python crashes, syntax error), returncode could be non-zero but not 1, causing the function to silently return False (no deprecated torch found). Consider logging result.stderr when returncode not in (0, 1) to surface unexpected probe failures.
🔵 Suggestion: .dockerignore fix is correct but comment could be more precise — .dockerignore:25
The comment now says "ignore isaac sim symlink (host absolute targets must not replace image ln -sf)" which is accurate. Minor: the original _isaac_sim? was a glob matching _isaac_sim + exactly one character — it never matched _isaac_sim at all. The fix is spot-on.
Test Coverage
The PR description claims "42 new unit tests in test_install_commands.py" but this file does not appear in the diff or in the repository at this commit. The only test file in source/isaaclab/test/cli/ is test_install.py. This is the most notable gap:
- The core bug fix (nvidia repoint skip when cudnn missing) has no visible regression test in this PR. A test that mocks a
pip_prebundle/nvidiadirectory withoutcudnnand verifies the repoint is skipped would be valuable. - The torch uninstall from extsDeprecated logic similarly has no visible test.
- The CUDA synchronize fix in
isaac_rtx_renderer.pyis a runtime GPU concern that's difficult to unit-test but should be covered by integration tests. - The API migration changes (import path updates) are implicitly tested by existing test suites that import these modules.
- The Docker fixes are verified by the Docker build CI jobs (currently pending).
If the test file was intended for a separate PR or was accidentally omitted from this commit, that should be clarified.
CI Status
- ✅ pre-commit: pass
- ✅ Check for Broken Links: pass
- ✅ labeler: pass
- ⏳ license-check: pending
- ⏳ Docker builds: pending
- ⏳ Docs builds: pending
Verdict
Minor fixes needed
The core bug fix is correct and well-reasoned. The API migration is mechanical and looks accurate. The Docker symlink fix addresses a real build issue. Two items worth addressing:
- Performance consideration: The per-data-type
torch.cuda.synchronize()in the tiled camera renderer could be moved outside the loop to avoid multiple sync points per frame. - Test gap: The PR description references
test_install_commands.pywith 42 tests, but this file isn't in the diff. Clarifying whether these tests are in a follow-up PR or were accidentally omitted would help confirm the regression test coverage for the core fix.
| if dev.type == "cuda": | ||
| cuda_idx = dev.index if dev.index is not None else torch.cuda.current_device() | ||
| with torch.cuda.device(cuda_idx): | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
🟡 Warning: Performance — synchronize per data type
This torch.cuda.synchronize() is inside the for data_type in ... loop. For cameras with multiple data types (e.g., rgb + depth + normals), this serializes the GPU pipeline multiple times per frame.
Consider moving the sync outside the data-type loop — a single sync after all warp kernels are launched would still prevent the cudaErrorIllegalAddress race while avoiding repeated stalls.
| [python_exe, "-c", probe], | ||
| env=env, | ||
| check=False, | ||
| capture_output=True, |
There was a problem hiding this comment.
🔵 Suggestion: Log unexpected probe failures
When returncode is neither 0 nor 1 (e.g., the subprocess crashes), the function silently returns False. Consider logging result.stderr for unexpected exit codes to aid debugging:
if result.returncode not in (0, 1) and result.stderr:
print_debug(f"extsDeprecated probe returned {result.returncode}: {result.stderr.strip()}")The test_settings.py changes in the previous commit (consolidating CUROBO_PLANNER_TESTS/SKILLGEN_TESTS, removing QUARANTINED_TESTS, renaming dataset test files, adjusting timeouts) were not related to the prebundle/install fix and are reverted here.
The torch.cuda.synchronize() guard added in the previous commit was not related to the prebundle/install fix and is reverted here.
Restores the CartpoleCameraEnvCfg TYPE_CHECKING import and the semantic_segmentation branch that were inadvertently removed in the prebundle commit.
Remove all isaacsim.core.* -> isaacsim.core.experimental.* migration changes, doc updates, and unrelated changelog/version bumps from this branch. These will land in a dedicated fix-isaacsim-deprecation-api PR.
- Add back pip install of nvidia-srl-usd-to-urdf and nvidia/srl symlink into extsDeprecated/omni.isaac.ml_archive/pip_prebundle in both Dockerfile.base and Dockerfile.curobo; fixes ModuleNotFoundError for nvidia.srl in Docker images - Restore torch.cuda.synchronize() after each tiled camera copy in IsaacRtxRenderer; fixes intermittent cudaErrorIllegalAddress when Replicator/Warp GPU writes race subsequent Torch ops
Instead of synchronizing after each wp.launch (one sync per data type), split the loop into two passes: launch all kernels, synchronize once, then do all Torch post-processing. Reduces N syncs to 1.
torch.cuda.synchronize() only waits on PyTorch's CUDA streams. Warp kernels run on their own stream, so the previous sync was a no-op for the actual reshape kernels causing cudaErrorIllegalAddress. wp.synchronize_device() properly drains the Warp stream.
… add tests - Rename _torch_first_on_sys_path_is_exts_deprecated → _torch_first_on_sys_path_is_prebundle: check for 'pip_prebundle' in the normalized path instead of 'extsDeprecated'. This catches the prebundle regardless of whether omni.isaac.ml_archive lives under exts/, extsDeprecated/, or any other search path. - Rename _maybe_uninstall_torch_if_exts_deprecated_prebundle_first_on_path → _maybe_uninstall_prebundled_torch for clarity. - Update all references in test_install_commands.py. - Add test_install_prebundle.py with _split_install_items tests and a probe script content verification test.
c0776b1 to
cc7cf77
Compare
Add omni.isaac.ml_archive and isaacsim.pip.newton to app.extensions.excluded in all kit files. These extensions prebundle torch, warp, newton and mujoco-warp that shadow Isaac Lab's pip-installed versions, causing CUDA runtime errors and warp kernel lookup failures. Keep extsDeprecated in extension search paths since other deprecated extensions may still be needed.
a422c5a to
f3570cf
Compare
Isaac Sim's setup_python_env.sh injects pip_prebundle directories onto PYTHONPATH before Python starts. These contain older copies of torch, warp, nvidia-cudnn etc. that shadow Isaac Lab's pip-installed versions, causing CUDA runtime errors and warp kernel lookup failures. Add _filter_prebundle_paths() to isaaclab/__init__.py that runs on import and strips paths from omni.isaac.ml_archive, omni.isaac.core_archive, omni.kit.pip_archive, and isaacsim.pip.newton prebundle dirs. Also cleans PYTHONPATH env var so child processes stay clean. Non-conflicting prebundle paths (lula, cumotion, urdf, compute, cloud) are preserved.
The previous commit filtered core_archive, pip_archive, and pip.newton prebundle dirs too. Those contain packages (matplotlib, Jinja2, Pillow, certifi, etc.) that the runtime needs — removing them broke 8 CI jobs. Only ml_archive prebundles the conflicting torch + nvidia CUDA libs. Narrow the filter to just that extension.
Removing ml_archive/pip_prebundle from sys.path also hides the nvidia shared libraries (libcudart, libcudnn, etc.) that torch loads via ctypes.CDLL at import time, causing: OSError: libcudart.so.12: cannot open shared object file After filtering sys.path, glob the removed prebundle dirs for nvidia/*/lib directories and append them to LD_LIBRARY_PATH so the .so files remain discoverable.
Removing ml_archive/pip_prebundle from sys.path entirely broke two things: 1. libcudart.so.12 not found (nvidia shared libs live in the prebundle) 2. sympy not found (only exists in the prebundle, needed by torch) Instead of removing the path, move it to the END of sys.path so that pip-installed packages (torch, warp, etc.) in site-packages take priority, while prebundle-only packages (sympy, nvidia libs) remain accessible as fallbacks.
Isaac Lab uses its own pip-installed warp-lang package. Enabling omni.warp.core loads Isaac Sim's bundled warp which can conflict with the pip-installed version.
Ubuntu's docs server returns 403 to automated requests (bot detection). This is a known issue with lychee link checking. Adding to the exclude list alongside other sites that block automated crawlers.
…broken links Combined fixes from PR isaac-sim#5279 on top of deprecation migration from PR isaac-sim#5280: 1. install.py: Add prebundle probe functions that check pip_prebundle paths instead of extsDeprecated, and uninstall pip torch when prebundle shadows it 2. isaaclab/__init__.py: Deprioritize ml_archive/pip_prebundle on sys.path so pip-installed torch/numpy/etc take priority over prebundled copies 3. setup.py: Pin mujoco==3.5.0 4. Kit files: Remove omni.warp.core to prevent conflict with pip warp-lang 5. check-links.yml: Add ubuntu.com to link checker exclusions (returns 403) 6. New tests: test_install_commands.py, test_install_prebundle.py
…broken links Combined fixes from PR isaac-sim#5279 on top of deprecation migration from PR isaac-sim#5280: 1. install.py: Add prebundle probe functions that check pip_prebundle paths instead of extsDeprecated, and uninstall pip torch when prebundle shadows it 2. isaaclab/__init__.py: Deprioritize ml_archive/pip_prebundle on sys.path so pip-installed torch/numpy/etc take priority over prebundled copies 3. setup.py: Pin mujoco==3.5.0 4. Kit files: Remove omni.warp.core to prevent conflict with pip warp-lang 5. check-links.yml: Add ubuntu.com to link checker exclusions (returns 403) 6. New tests: test_install_commands.py, test_install_prebundle.py
|
Closing this temporarily |
…broken links Combined fixes from PR isaac-sim#5279 on top of deprecation migration from PR isaac-sim#5280: 1. install.py: Add prebundle probe functions that check pip_prebundle paths instead of extsDeprecated, and uninstall pip torch when prebundle shadows it 2. isaaclab/__init__.py: Deprioritize ml_archive/pip_prebundle on sys.path so pip-installed torch/numpy/etc take priority over prebundled copies 3. setup.py: Pin mujoco==3.5.0 4. Kit files: Remove omni.warp.core to prevent conflict with pip warp-lang 5. check-links.yml: Add ubuntu.com to link checker exclusions (returns 403) 6. New tests: test_install_commands.py, test_install_prebundle.py
…broken links Combined fixes from PR isaac-sim#5279 on top of deprecation migration from PR isaac-sim#5280: 1. install.py: Add prebundle probe functions that check pip_prebundle paths instead of extsDeprecated, and uninstall pip torch when prebundle shadows it 2. isaaclab/__init__.py: Deprioritize ml_archive/pip_prebundle on sys.path so pip-installed torch/numpy/etc take priority over prebundled copies 3. setup.py: Pin mujoco==3.5.0 4. Kit files: Remove omni.warp.core to prevent conflict with pip warp-lang 5. check-links.yml: Add ubuntu.com to link checker exclusions (returns 403) 6. New tests: test_install_commands.py, test_install_prebundle.py
Description
Bug fix:
_repoint_prebundle_packages()was replacing the nvidia namespace package inpip_prebundlewith a symlink to Isaac Sim'ssite-packages/nvidia/(which only contains srl). This stripped away the CUDA shared libraries (libcudnn.so.9, etc.) that torch needs, causingImportError: libcudnn.so.9: cannot open shared object file. Fixed by skipping the nvidia repoint when the target site-packages does not contain the cudnn subpackage.Tests: Added 42 new unit tests in test_install_commands.py covering
_repoint_prebundle_packages,_ensure_cuda_torch, and_maybe_uninstall_torch_if_exts_deprecatedacross all supported installation paths (conda, venv, kit Python, system Python; x86 and ARM). Extended test_install.py with three new TestExtractPythonExe cases.CI: Added a pull_request path trigger to install-ci.yml so installation tests run automatically when the relevant files change. Added || 'ubuntu:24.04' fallback for BASE_IMAGE so PR-triggered runs get the correct default image.
Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there