Skip to content

[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13#23119

Merged
Fridge003 merged 105 commits intomainfrom
kangyan/ci-uv-venv-migration
Apr 19, 2026
Merged

[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13#23119
Fridge003 merged 105 commits intomainfrom
kangyan/ci-uv-venv-migration

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Kangyan-Zhou and others added 15 commits April 16, 2026 15:07
Phase 0 scaffolding for the uv venv migration: runners can opt into a fresh
per-job venv via SGLANG_CI_USE_VENV=1, which eliminates the stale CUDA .so
accumulation in the runner's writable layer across toolkit bumps (e.g.
cu129 -> cu130 -> cu129 revert). Toggle is off by default; legacy path is
behaviorally unchanged.

The install script now auto-detects CU_VERSION from nvcc in venv mode,
validates host-driver >= container toolkit, guards against unsupported
CUDA versions, discovers nvidia/torch .so directories for LD_LIBRARY_PATH,
and runs a smoke test that asserts loaded NVIDIA libs resolve under
$VIRTUAL_ENV (catching runtime shadowing that plain ldd misses).

ci_install_deepep.sh now sources ci_install_dependency.sh so venv
activation propagates, and replaces nvidia-smi CUDA detection with the
inherited $NVCC_VER. All bare `pip` calls converted to $PIP_CMD.

Adds ci_cleanup_venv.sh (best-effort post-job cleanup) and a canary
workflow that forces the venv path on 1-gpu-5090; install/sanity fail
loudly while the test run is continue-on-error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If a CI job is cancelled mid-run, the per-job uv venv persists in /tmp.
Add a sweep of /tmp/sglang-ci-* dirs older than 4 hours to
ci_cleanup_venv.sh, complementing the per-job targeted removal.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added dependencies Pull requests that update a dependency file sgl-kernel labels Apr 18, 2026
@Fridge003
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@Fridge003 Fridge003 merged commit 6ecd6f8 into main Apr 19, 2026
13 of 24 checks passed
@Fridge003 Fridge003 deleted the kangyan/ci-uv-venv-migration branch April 19, 2026 12:32
alisonshao added a commit that referenced this pull request Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two
changes that make deep_gemm's bf16 JIT cache work across the per-job venv:

- UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the
  resulting cache-key hash) is identical across every job and container.
- export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache —
  redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container
  writable layer) into the already-host-mounted deep_gemm subdir, so all
  containers on a host share compiled kernels.

Both are independently required: neither alone makes cross-container
cache hit work. Verified on an H200 host, two separate containers:
first compile 2.0s, cross-container read 0.010s (~220x speedup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alisonshao added a commit that referenced this pull request Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two
changes that make deep_gemm's bf16 JIT cache work across the per-job venv:

- UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the
  resulting cache-key hash) is identical across every job and container.
- export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache —
  redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container
  writable layer) into the already-host-mounted deep_gemm subdir, so all
  containers on a host share compiled kernels.

Both are independently required: neither alone makes cross-container
cache hit work. Verified on an H200 host, two separate containers:
first compile 2.0s, cross-container read 0.010s (~220x speedup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alisonshao added a commit that referenced this pull request Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two
changes that make deep_gemm's bf16 JIT cache work across the per-job venv:

- UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the
  resulting cache-key hash) is identical across every job and container.
- export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache —
  redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container
  writable layer) into the already-host-mounted deep_gemm subdir, so all
  containers on a host share compiled kernels.

Both are independently required: neither alone makes cross-container
cache hit work. Verified on an H200 host, two separate containers:
first compile 2.0s, cross-container read 0.010s (~220x speedup).
alisonshao added a commit that referenced this pull request Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by pairing it with two
changes that make deep_gemm's bf16 JIT cache work across the per-job venv:

- UV_VENV="/tmp/sglang-ci-venv" — stable path so library_root (and the
  resulting cache-key hash) is identical across every job and container.
- export DG_JIT_CACHE_DIR=/root/.cache/deep_gemm/bf16_jit_cache —
  redirects deep_gemm's bf16 cache out of /root/.deep_gemm/ (container
  writable layer) into the already-host-mounted deep_gemm subdir, so all
  containers on a host share compiled kernels.

Both are independently required: neither alone makes cross-container
cache hit work. Verified on an H200 host, two separate containers:
first compile 2.0s, cross-container read 0.010s (~220x speedup).
alisonshao added a commit that referenced this pull request Apr 19, 2026
Re-enables per-job uv venv (disabled in #23119) by using a stable venv
path so deep_gemm's NVCC file cache remains reusable across jobs.

- USE_VENV default flipped to 1 (install script + pr-test.yml env).
- UV_VENV="/tmp/sglang-ci-venv" — stable across every job and container
  on a host. deep_gemm hashes library_root (the abspath of the deep_gemm
  package) into its cache key, so varying the venv path per job breaks
  cache reuse. Holding the path constant keeps the hash constant.
- rm -rf before `uv venv` handles a stale dir from a crashed prior job.
- Cleanup script fallback targets the stable path (glob no longer needed).

deep_gemm's cache dir itself is already host-mounted on every runner
(/root/.cache/deep_gemm/), so no additional mount or env-var change is
needed — the stable venv path alone restores cross-container cache
sharing. Verified on an H200 host: compile ~2s in one container,
cross-container read ~0.01s in a second container at the same stable
path.
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Apr 19, 2026
Port PR sgl-project#23136 (Yuhao Yang): cudaMemcpyBatchAsync lost its failIdx
parameter in CUDA 13, so the dlsym-based call was passing the stream
handle at the wrong slot and segfaulting inside cuMemcpyBatchAsync_v2.
Use driver_version at runtime to dispatch to either the CUDA 12 or
CUDA 13 signature.

With the segfault fixed, move the 7 hicache tests that were parked
under test/manual in PR sgl-project#23119 and subsequent cu13 flake sweeps back
into test/registered so they run in CI again:

- hicache/test_hicache_storage.py
- hicache/test_hicache_storage_3fs_backend.py
- hicache/test_hicache_storage_file_backend.py
- hicache/test_hicache_storage_mooncake_backend.py
- hicache/test_hicache_storage_runtime_attach_detach.py
- hicache/test_hicache_variants.py
- 4-gpu-models/test_qwen35_hicache.py

TODO "move back after fixed" docstrings are stripped and the
register_cuda_ci call that was dropped from the mooncake backend test
on its way to manual is restored.

Co-Authored-By: Yuhao Yang <yhyang201@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Apr 19, 2026
After the main `uv pip install -e python[...]` step, runners that carried
state from the pre-sgl-project#23119 (cu129) era keep `nvidia-cuda-runtime-cu12`
installed as an orphan (Required-by: empty) alongside the cu13 runtime.
Its libcudart.so.12 sits under `nvidia/cuda_runtime/lib/` while cu13's
lives under `nvidia/cu13/lib/`. Both dirs end up on LD_LIBRARY_PATH, so
cudnn_frontend_shim.h's probe

    for lib in ["libcudart.so.12", "libcudart.so.13"]:
        dlopen(lib)

loads both and throws:

    RuntimeError: Multiple libcudart libraries found:
    libcudart.so.12 and libcudart.so.13

Tests hit this during server setUpClass → CUDA graph capture (e.g.
test_nvfp4_gemm_sm120.py on stage-b-test-1-gpu-small). The same failure
reproduces on main, so this is not PR-specific — it's a leftover cleanup
step the cu13 migration missed.

Fix: uninstall nvidia-cuda-runtime-cu12 right after the main install.
Its install dir is disjoint from cu13's so the uninstall doesn't touch
any files shared with cu13 packages (a blunter sweep of all
`nvidia-*-cu12` breaks torch because several pairs share dirs under
`nvidia/<name>/lib/` and uninstalling one deletes files that the cu13
variant still references through its RECORD).

Reproduced and verified on 5090-novita-ci-runner-d (runner-1 container):

    before: libcudart.so.12 + libcudart.so.13 both loadable
    after : only libcudart.so.13 loadable, torch.cuda.randn works

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Apr 20, 2026
`modelopt_quant` and `modelopt_export_path` were removed from
ModelConfig.__init__ in sgl-project#10154 (replaced by unified `quantization`
flag and LoadConfig.modelopt_export_path), but the test was never
updated. It stayed latent because the class is skipped when
nvidia-modelopt isn't installed; sgl-project#23119 added the dep to the CI
image yesterday, which exposed the failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Apr 20, 2026
`modelopt_quant` and `modelopt_export_path` were removed from
ModelConfig.__init__ in sgl-project#10154 (replaced by unified `quantization`
flag and LoadConfig.modelopt_export_path), but the test was never
updated. It stayed latent because the class is skipped when
nvidia-modelopt isn't installed; sgl-project#23119 added the dep to the CI
image, which exposed the failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 22, 2026
Since sgl-project#23119 flipped the `sgl-kernel-build-wheels` matrix to
`cuda-version: "13.0"` only (in preparation for the CI torch upgrade),
any PR touching sgl-kernel is silently reverted on H-series (SM90)
runners whose test env torch is still cu129.

The failure mode is invisible on the surface:

  1. PR's `sgl-kernel-build-wheels` produces a cu130 wheel (artifact
     `wheel-python3.10-cuda13.0`).
  2. H-series test jobs download that wheel into `sgl-kernel/dist/`
     and `ci_install_dependency.sh` installs it.
  3. The script's "sgl-kernel +cuXYZ ≠ CU_VERSION" guard (correct in
     its intent -- a cu130 wheel is genuinely ABI-incompat with cu129
     torch) then reinstalls `sglang-kernel==<ver>` from the public
     Artifactory index -- replacing the PR's built wheel with the
     main-branch wheel that the runner is compatible with.
  4. Any sgl-kernel change in the PR (new kernel signatures, schema
     tweaks, etc.) is silently dropped. Python-side editable code keeps
     the PR's expectations -> `TypeError: unexpected keyword argument`
     at first call.

Example: PR sgl-project#21985 adds `out=` to `flash_attn_with_kvcache`. The
Python wrapper (editable) passes `out=`, but the reinstalled main
wheel's C++ op doesn't accept it -> TypeError on
`stage-c-test-8-gpu-h20`.

Fix:

1. Restore `cuda-version: "12.9"` as a second matrix entry in both the
   x86_64 and aarch64 `sgl-kernel-build-wheels` jobs, so every PR
   produces BOTH cu129 and cu130 wheels.
2. Change all test-job `download-artifact` patterns from
   `wheel-python3.10-cuda13.0` to `wheel-python3.10-cuda*` so both
   wheels land in `sgl-kernel/dist/` (`merge-multiple: true` already
   set).
3. In `ci_install_dependency.sh`, select the wheel matching
   `$CU_VERSION` by name (`+${CU_VERSION}`), falling back to the
   previous "any matching wheel" glob if a single-CUDA wheel is all
   that's present -- preserves pre-sgl-project#23119 behavior for branches that
   haven't picked up this change.

After this patch:
  - B200 (cu130) tests install the cu130 wheel, no reinstall.
  - H-series (cu129) tests install the cu129 wheel, no reinstall.
  - The public-index fallback only fires when the PR didn't build its
    own wheel (e.g. `/rerun-stage` without kernel rebuild), matching
    its original purpose.

Cost: one extra matrix job per PR that touches sgl-kernel (~10 min on
x86_64, ~10 min on aarch64). Net per-PR CI runtime change is positive
for sgl-kernel PRs (no more silently-passing tests that were really
running main's wheel) and zero for other PRs (matrix doesn't run when
there's nothing to build).
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 22, 2026
Since sgl-project#23119 flipped the `sgl-kernel-build-wheels` matrix to
`cuda-version: "13.0"` only (in preparation for the CI torch upgrade),
any PR touching sgl-kernel is silently reverted on H-series (SM90)
runners whose test env torch is still cu129.

The failure mode is invisible on the surface:

  1. PR's `sgl-kernel-build-wheels` produces a cu130 wheel (artifact
     `wheel-python3.10-cuda13.0`).
  2. H-series test jobs download that wheel into `sgl-kernel/dist/`
     and `ci_install_dependency.sh` installs it.
  3. The script's "sgl-kernel +cuXYZ ≠ CU_VERSION" guard (correct in
     its intent -- a cu130 wheel is genuinely ABI-incompat with cu129
     torch) then reinstalls `sglang-kernel==<ver>` from the public
     Artifactory index -- replacing the PR's built wheel with the
     main-branch wheel that the runner is compatible with.
  4. Any sgl-kernel change in the PR (new kernel signatures, schema
     tweaks, etc.) is silently dropped. Python-side editable code keeps
     the PR's expectations -> `TypeError: unexpected keyword argument`
     at first call.

Example: PR sgl-project#21985 adds `out=` to `flash_attn_with_kvcache`. The
Python wrapper (editable) passes `out=`, but the reinstalled main
wheel's C++ op doesn't accept it -> TypeError on
`stage-c-test-8-gpu-h20`.

Fix:

1. Restore `cuda-version: "12.9"` as a second matrix entry in both the
   x86_64 and aarch64 `sgl-kernel-build-wheels` jobs, so every PR
   produces BOTH cu129 and cu130 wheels.
2. Change all test-job `download-artifact` patterns from
   `wheel-python3.10-cuda13.0` to `wheel-python3.10-cuda*` so both
   wheels land in `sgl-kernel/dist/` (`merge-multiple: true` already
   set).
3. In `ci_install_dependency.sh`, select the wheel matching
   `$CU_VERSION` by name (`+${CU_VERSION}`), falling back to the
   previous "any matching wheel" glob if a single-CUDA wheel is all
   that's present -- preserves pre-sgl-project#23119 behavior for branches that
   haven't picked up this change.

After this patch:
  - B200 (cu130) tests install the cu130 wheel, no reinstall.
  - H-series (cu129) tests install the cu129 wheel, no reinstall.
  - The public-index fallback only fires when the PR didn't build its
    own wheel (e.g. `/rerun-stage` without kernel rebuild), matching
    its original purpose.

Cost: one extra matrix job per PR that touches sgl-kernel (~10 min on
x86_64, ~10 min on aarch64). Net per-PR CI runtime change is positive
for sgl-kernel PRs (no more silently-passing tests that were really
running main's wheel) and zero for other PRs (matrix doesn't run when
there's nothing to build).
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
…gl-project#23119)

Co-authored-by: Kangyan Zhou <zky314343421@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Alison Shao <a.shao@wustl.edu>
Co-authored-by: Mick <mickjagger19@icloud.com>
Kangyan-Zhou added a commit that referenced this pull request Apr 26, 2026
The 7 hicache tests below were moved from test/registered to test/manual
in PR #23119 (cu13 upgrade) and follow-up flake sweeps because they hit
the cudaMemcpyBatchAsync segfault on CUDA 13. That segfault is fixed in
sglang-kernel 0.4.1.post1 (this PR), so move the tests back into
test/registered:

- hicache/test_hicache_storage.py
- hicache/test_hicache_storage_3fs_backend.py
- hicache/test_hicache_storage_file_backend.py
- hicache/test_hicache_storage_mooncake_backend.py
- hicache/test_hicache_storage_runtime_attach_detach.py
- hicache/test_hicache_variants.py
- 4-gpu-models/test_qwen35_hicache.py

TODO "move back after fixed" docstrings are stripped and the
register_cuda_ci call dropped from the mooncake backend test on its way
to manual is restored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
…gl-project#23119)

Co-authored-by: Kangyan Zhou <zky314343421@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Alison Shao <a.shao@wustl.edu>
Co-authored-by: Mick <mickjagger19@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bypass-maintenance dependencies Pull requests that update a dependency file diffusion SGLang Diffusion hicache Hierarchical Caching for SGLang high priority jit-kernel lora Multi-modal multi-modal language model quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants