[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13#23119
Merged
[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13#23119
Conversation
Phase 0 scaffolding for the uv venv migration: runners can opt into a fresh per-job venv via SGLANG_CI_USE_VENV=1, which eliminates the stale CUDA .so accumulation in the runner's writable layer across toolkit bumps (e.g. cu129 -> cu130 -> cu129 revert). Toggle is off by default; legacy path is behaviorally unchanged. The install script now auto-detects CU_VERSION from nvcc in venv mode, validates host-driver >= container toolkit, guards against unsupported CUDA versions, discovers nvidia/torch .so directories for LD_LIBRARY_PATH, and runs a smoke test that asserts loaded NVIDIA libs resolve under $VIRTUAL_ENV (catching runtime shadowing that plain ldd misses). ci_install_deepep.sh now sources ci_install_dependency.sh so venv activation propagates, and replaces nvidia-smi CUDA detection with the inherited $NVCC_VER. All bare `pip` calls converted to $PIP_CMD. Adds ci_cleanup_venv.sh (best-effort post-job cleanup) and a canary workflow that forces the venv path on 1-gpu-5090; install/sanity fail loudly while the test run is continue-on-error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If a CI job is cancelled mid-run, the per-job uv venv persists in /tmp. Add a sweep of /tmp/sglang-ci-* dirs older than 4 hours to ci_cleanup_venv.sh, complementing the per-job targeted removal.
This reverts commit e8e2c1e.
This reverts commit 3d4d7e0.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
This reverts commit e978aac.
…to kangyan/ci-uv-venv-migration
alisonshao
added a commit
that referenced
this pull request
Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two changes that make deep_gemm's bf16 JIT cache work across the per-job venv: - UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the resulting cache-key hash) is identical across every job and container. - export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache — redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container writable layer) into the already-host-mounted deep_gemm subdir, so all containers on a host share compiled kernels. Both are independently required: neither alone makes cross-container cache hit work. Verified on an H200 host, two separate containers: first compile 2.0s, cross-container read 0.010s (~220x speedup). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
alisonshao
added a commit
that referenced
this pull request
Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two changes that make deep_gemm's bf16 JIT cache work across the per-job venv: - UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the resulting cache-key hash) is identical across every job and container. - export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache — redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container writable layer) into the already-host-mounted deep_gemm subdir, so all containers on a host share compiled kernels. Both are independently required: neither alone makes cross-container cache hit work. Verified on an H200 host, two separate containers: first compile 2.0s, cross-container read 0.010s (~220x speedup). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alisonshao
added a commit
that referenced
this pull request
Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two changes that make deep_gemm's bf16 JIT cache work across the per-job venv: - UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the resulting cache-key hash) is identical across every job and container. - export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache — redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container writable layer) into the already-host-mounted deep_gemm subdir, so all containers on a host share compiled kernels. Both are independently required: neither alone makes cross-container cache hit work. Verified on an H200 host, two separate containers: first compile 2.0s, cross-container read 0.010s (~220x speedup).
alisonshao
added a commit
that referenced
this pull request
Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two changes that make deep_gemm's bf16 JIT cache work across the per-job venv: - UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the resulting cache-key hash) is identical across every job and container. - export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache — redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container writable layer) into the already-host-mounted deep_gemm subdir, so all containers on a host share compiled kernels. Both are independently required: neither alone makes cross-container cache hit work. Verified on an H200 host, two separate containers: first compile 2.0s, cross-container read 0.010s (~220x speedup).
alisonshao
added a commit
that referenced
this pull request
Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by using a stable venv path so deep_gemm's NVCC file cache remains reusable across jobs. - USE_VENV default flipped to 1 (install script + pr-test.yml env). - UV_VENV="/tmp/sglang-ci-venv" — stable across every job and container on a host. deep_gemm hashes library_root (the abspath of the deep_gemm package) into its cache key, so varying the venv path per job breaks cache reuse. Holding the path constant keeps the hash constant. - rm -rf before `uv venv` handles a stale dir from a crashed prior job. - Cleanup script fallback targets the stable path (glob no longer needed). deep_gemm's cache dir itself is already host-mounted on every runner (/root/.cache/deep_gemm/), so no additional mount or env-var change is needed — the stable venv path alone restores cross-container cache sharing. Verified on an H200 host: compile ~2s in one container, cross-container read ~0.01s in a second container at the same stable path.
Kangyan-Zhou
added a commit
to Kangyan-Zhou/sglang
that referenced
this pull request
Apr 19, 2026
Port PR sgl-project#23136 (Yuhao Yang): cudaMemcpyBatchAsync lost its failIdx parameter in CUDA 13, so the dlsym-based call was passing the stream handle at the wrong slot and segfaulting inside cuMemcpyBatchAsync_v2. Use driver_version at runtime to dispatch to either the CUDA 12 or CUDA 13 signature. With the segfault fixed, move the 7 hicache tests that were parked under test/manual in PR sgl-project#23119 and subsequent cu13 flake sweeps back into test/registered so they run in CI again: - hicache/test_hicache_storage.py - hicache/test_hicache_storage_3fs_backend.py - hicache/test_hicache_storage_file_backend.py - hicache/test_hicache_storage_mooncake_backend.py - hicache/test_hicache_storage_runtime_attach_detach.py - hicache/test_hicache_variants.py - 4-gpu-models/test_qwen35_hicache.py TODO "move back after fixed" docstrings are stripped and the register_cuda_ci call that was dropped from the mooncake backend test on its way to manual is restored. Co-Authored-By: Yuhao Yang <yhyang201@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Kangyan-Zhou
added a commit
to Kangyan-Zhou/sglang
that referenced
this pull request
Apr 19, 2026
After the main `uv pip install -e python[...]` step, runners that carried state from the pre-sgl-project#23119 (cu129) era keep `nvidia-cuda-runtime-cu12` installed as an orphan (Required-by: empty) alongside the cu13 runtime. Its libcudart.so.12 sits under `nvidia/cuda_runtime/lib/` while cu13's lives under `nvidia/cu13/lib/`. Both dirs end up on LD_LIBRARY_PATH, so cudnn_frontend_shim.h's probe for lib in ["libcudart.so.12", "libcudart.so.13"]: dlopen(lib) loads both and throws: RuntimeError: Multiple libcudart libraries found: libcudart.so.12 and libcudart.so.13 Tests hit this during server setUpClass → CUDA graph capture (e.g. test_nvfp4_gemm_sm120.py on stage-b-test-1-gpu-small). The same failure reproduces on main, so this is not PR-specific — it's a leftover cleanup step the cu13 migration missed. Fix: uninstall nvidia-cuda-runtime-cu12 right after the main install. Its install dir is disjoint from cu13's so the uninstall doesn't touch any files shared with cu13 packages (a blunter sweep of all `nvidia-*-cu12` breaks torch because several pairs share dirs under `nvidia/<name>/lib/` and uninstalling one deletes files that the cu13 variant still references through its RECORD). Reproduced and verified on 5090-novita-ci-runner-d (runner-1 container): before: libcudart.so.12 + libcudart.so.13 both loadable after : only libcudart.so.13 loadable, torch.cuda.randn works Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Kangyan-Zhou
added a commit
to Kangyan-Zhou/sglang
that referenced
this pull request
Apr 20, 2026
`modelopt_quant` and `modelopt_export_path` were removed from ModelConfig.__init__ in sgl-project#10154 (replaced by unified `quantization` flag and LoadConfig.modelopt_export_path), but the test was never updated. It stayed latent because the class is skipped when nvidia-modelopt isn't installed; sgl-project#23119 added the dep to the CI image yesterday, which exposed the failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
Kangyan-Zhou
added a commit
to Kangyan-Zhou/sglang
that referenced
this pull request
Apr 20, 2026
`modelopt_quant` and `modelopt_export_path` were removed from ModelConfig.__init__ in sgl-project#10154 (replaced by unified `quantization` flag and LoadConfig.modelopt_export_path), but the test was never updated. It stayed latent because the class is skipped when nvidia-modelopt isn't installed; sgl-project#23119 added the dep to the CI image, which exposed the failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 22, 2026
Since sgl-project#23119 flipped the `sgl-kernel-build-wheels` matrix to `cuda-version: "13.0"` only (in preparation for the CI torch upgrade), any PR touching sgl-kernel is silently reverted on H-series (SM90) runners whose test env torch is still cu129. The failure mode is invisible on the surface: 1. PR's `sgl-kernel-build-wheels` produces a cu130 wheel (artifact `wheel-python3.10-cuda13.0`). 2. H-series test jobs download that wheel into `sgl-kernel/dist/` and `ci_install_dependency.sh` installs it. 3. The script's "sgl-kernel +cuXYZ ≠ CU_VERSION" guard (correct in its intent -- a cu130 wheel is genuinely ABI-incompat with cu129 torch) then reinstalls `sglang-kernel==<ver>` from the public Artifactory index -- replacing the PR's built wheel with the main-branch wheel that the runner is compatible with. 4. Any sgl-kernel change in the PR (new kernel signatures, schema tweaks, etc.) is silently dropped. Python-side editable code keeps the PR's expectations -> `TypeError: unexpected keyword argument` at first call. Example: PR sgl-project#21985 adds `out=` to `flash_attn_with_kvcache`. The Python wrapper (editable) passes `out=`, but the reinstalled main wheel's C++ op doesn't accept it -> TypeError on `stage-c-test-8-gpu-h20`. Fix: 1. Restore `cuda-version: "12.9"` as a second matrix entry in both the x86_64 and aarch64 `sgl-kernel-build-wheels` jobs, so every PR produces BOTH cu129 and cu130 wheels. 2. Change all test-job `download-artifact` patterns from `wheel-python3.10-cuda13.0` to `wheel-python3.10-cuda*` so both wheels land in `sgl-kernel/dist/` (`merge-multiple: true` already set). 3. In `ci_install_dependency.sh`, select the wheel matching `$CU_VERSION` by name (`+${CU_VERSION}`), falling back to the previous "any matching wheel" glob if a single-CUDA wheel is all that's present -- preserves pre-sgl-project#23119 behavior for branches that haven't picked up this change. After this patch: - B200 (cu130) tests install the cu130 wheel, no reinstall. - H-series (cu129) tests install the cu129 wheel, no reinstall. - The public-index fallback only fires when the PR didn't build its own wheel (e.g. `/rerun-stage` without kernel rebuild), matching its original purpose. Cost: one extra matrix job per PR that touches sgl-kernel (~10 min on x86_64, ~10 min on aarch64). Net per-PR CI runtime change is positive for sgl-kernel PRs (no more silently-passing tests that were really running main's wheel) and zero for other PRs (matrix doesn't run when there's nothing to build).
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 22, 2026
Since sgl-project#23119 flipped the `sgl-kernel-build-wheels` matrix to `cuda-version: "13.0"` only (in preparation for the CI torch upgrade), any PR touching sgl-kernel is silently reverted on H-series (SM90) runners whose test env torch is still cu129. The failure mode is invisible on the surface: 1. PR's `sgl-kernel-build-wheels` produces a cu130 wheel (artifact `wheel-python3.10-cuda13.0`). 2. H-series test jobs download that wheel into `sgl-kernel/dist/` and `ci_install_dependency.sh` installs it. 3. The script's "sgl-kernel +cuXYZ ≠ CU_VERSION" guard (correct in its intent -- a cu130 wheel is genuinely ABI-incompat with cu129 torch) then reinstalls `sglang-kernel==<ver>` from the public Artifactory index -- replacing the PR's built wheel with the main-branch wheel that the runner is compatible with. 4. Any sgl-kernel change in the PR (new kernel signatures, schema tweaks, etc.) is silently dropped. Python-side editable code keeps the PR's expectations -> `TypeError: unexpected keyword argument` at first call. Example: PR sgl-project#21985 adds `out=` to `flash_attn_with_kvcache`. The Python wrapper (editable) passes `out=`, but the reinstalled main wheel's C++ op doesn't accept it -> TypeError on `stage-c-test-8-gpu-h20`. Fix: 1. Restore `cuda-version: "12.9"` as a second matrix entry in both the x86_64 and aarch64 `sgl-kernel-build-wheels` jobs, so every PR produces BOTH cu129 and cu130 wheels. 2. Change all test-job `download-artifact` patterns from `wheel-python3.10-cuda13.0` to `wheel-python3.10-cuda*` so both wheels land in `sgl-kernel/dist/` (`merge-multiple: true` already set). 3. In `ci_install_dependency.sh`, select the wheel matching `$CU_VERSION` by name (`+${CU_VERSION}`), falling back to the previous "any matching wheel" glob if a single-CUDA wheel is all that's present -- preserves pre-sgl-project#23119 behavior for branches that haven't picked up this change. After this patch: - B200 (cu130) tests install the cu130 wheel, no reinstall. - H-series (cu129) tests install the cu129 wheel, no reinstall. - The public-index fallback only fires when the PR didn't build its own wheel (e.g. `/rerun-stage` without kernel rebuild), matching its original purpose. Cost: one extra matrix job per PR that touches sgl-kernel (~10 min on x86_64, ~10 min on aarch64). Net per-PR CI runtime change is positive for sgl-kernel PRs (no more silently-passing tests that were really running main's wheel) and zero for other PRs (matrix doesn't run when there's nothing to build).
5 tasks
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
…gl-project#23119) Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Alison Shao <a.shao@wustl.edu> Co-authored-by: Mick <mickjagger19@icloud.com>
Kangyan-Zhou
added a commit
that referenced
this pull request
Apr 26, 2026
The 7 hicache tests below were moved from test/registered to test/manual in PR #23119 (cu13 upgrade) and follow-up flake sweeps because they hit the cudaMemcpyBatchAsync segfault on CUDA 13. That segfault is fixed in sglang-kernel 0.4.1.post1 (this PR), so move the tests back into test/registered: - hicache/test_hicache_storage.py - hicache/test_hicache_storage_3fs_backend.py - hicache/test_hicache_storage_file_backend.py - hicache/test_hicache_storage_mooncake_backend.py - hicache/test_hicache_storage_runtime_attach_detach.py - hicache/test_hicache_variants.py - 4-gpu-models/test_qwen35_hicache.py TODO "move back after fixed" docstrings are stripped and the register_cuda_ci call dropped from the mooncake backend test on its way to manual is restored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kyx1999
pushed a commit
to KMSorSMS/sglang
that referenced
this pull request
Apr 27, 2026
…gl-project#23119) Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Alison Shao <a.shao@wustl.edu> Co-authored-by: Mick <mickjagger19@icloud.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci