Studio: download paired cudart bundle on Windows CUDA installs by danielhanchen · Pull Request #5322 · unslothai/unsloth

danielhanchen · 2026-05-07T10:38:39Z

Summary

Fixes the Windows half of #5106 (multiple users reporting "GPU detected but model loads entirely on RAM/CPU"). Studio's installer was downloading only the main llama-<tag>-bin-win-cuda-X.Y-x64.zip archive, but upstream ggml-org/llama.cpp ships Windows CUDA in two archives that the release notes explicitly require both of:

llama-<tag>-bin-win-cuda-X.Y-x64.zip (binaries + ggml DLLs)
cudart-llama-bin-win-cuda-X.Y-x64.zip (cudart64_X.dll, cublas64_X.dll, cublasLt64_X.dll)

The AssetChoice.runtime_name / runtime_url fields existed but were never populated, and install_from_archives only handled choice.url. With cudart DLLs missing from install_dir/build/bin/Release, the prebuilt's LoadLibrary calls only resolved when the user happened to have a version-matched system CUDA toolkit on PATH. That is why nvidia-smi reported the GPU and Studio's nvidia-smi probe reported free VRAM, but llama-server's CUDA backend silently failed to initialize and the model loaded entirely on CPU regardless of -ngl or --fit on.

What changed

windows_cuda_attempts and published_windows_cuda_attempts now look up the paired cudart-llama-bin-win-cuda-X.Y-x64.zip URL alongside the main archive and store it as runtime_url / runtime_name on the AssetChoice. Pairing only happens when the selected main archive is the binary archive (name starts with llama-); the legacy cudart-only naming path stays intact.
apply_approved_hashes resolves the runtime archive's hash from the approved manifest. If the manifest does not list the runtime archive, the pairing is dropped rather than installing an unverified runtime. Published bundles keep their supply-chain guarantee; upstream installs without a manifest sit on the same risk surface as the existing main-archive download.
install_from_archives downloads the runtime archive into its own temp dir and runs copy_globs against both source dirs. Separate dirs avoid the "ambiguous archive layout" guard tripping on shared filenames like LICENSE.txt, while the second copy_globs overlay drops the cudart DLLs alongside llama-server.exe in install_dir/build/bin/Release.
Added a runtime_sha256 field on AssetChoice to carry the verified hash through to the download step.

Test plan

python -m pytest tests/studio/install/test_selection_logic.py (130 passed: 125 baseline + 5 new)
python -m pytest tests/studio/install/test_install_llama_prebuilt_logic.py tests/studio/install/test_pr4562_bugfixes.py tests/studio/install/test_llama_pr_force_and_source.py (all green)
Validate on a Windows host without a system CUDA toolkit: fresh install picks up cudart DLLs from the release and nvidia-smi shows VRAM usage during inference
Validate on a Windows host that previously hit [Bug] Not Unsloth Studio detects my GPU But not use but only uses CPU/RAM #5106 (driver CUDA 13.2, no toolkit): no longer needs winget install Nvidia.CUDA

New tests added in tests/studio/install/test_selection_logic.py:

test_cudart_runtime_archive_is_paired -- upstream attempts populate runtime_url
test_no_runtime_archive_when_cudart_absent -- graceful degrade when cudart asset is missing in the release
test_cudart_only_assets_do_not_self_pair -- legacy cudart-only naming path stays unchanged
test_runtime_hash_threaded_when_present -- apply_approved_hashes resolves runtime_sha256 from the manifest
test_runtime_pair_dropped_when_hash_missing -- pairing is dropped when the manifest lacks the runtime hash

Refs #5106

Upstream ggml-org/llama.cpp publishes Windows CUDA in two archives that the release notes explicitly say are both required: llama-<tag>-bin-win-cuda-X.Y-x64.zip (binaries + ggml DLLs) cudart-llama-bin-win-cuda-X.Y-x64.zip (cudart64, cublas64, cublasLt64) Studio's installer was downloading only the first one. The ``runtime_name`` / ``runtime_url`` fields on AssetChoice existed but were never populated, and ``install_from_archives`` only handled ``choice.url``. With the cudart DLLs missing from ``install_dir/build/bin/Release``, the prebuilt binary's LoadLibrary calls only resolved at runtime when the user happened to have a version-matched system CUDA toolkit on PATH. That is the underlying cause for the Windows reports in #5106 ("GPU detected but model loaded entirely on RAM"): the prebuilt's CUDA backend silently fails to load and llama-server falls back to CPU regardless of ``-ngl`` or ``--fit on``. Wires the pairing through end to end: * ``windows_cuda_attempts`` and ``published_windows_cuda_attempts`` look up the matching ``cudart-llama-bin-win-cuda-X.Y-x64.zip`` asset URL alongside the main archive and store it as ``runtime_url`` / ``runtime_name`` on the AssetChoice. We only pair when the selected main archive is the binary archive (``llama-...zip``) so the legacy cudart-only naming path is unaffected. * ``apply_approved_hashes`` resolves the runtime archive's hash from the approved manifest. If the manifest does not list the runtime archive, the pairing is dropped rather than installing without checksum coverage. Preserves the supply-chain guarantee for published bundles; upstream installs with no manifest are unaffected (same risk surface as the existing main-archive download). * ``install_from_archives`` now downloads the runtime archive into a separate temp dir and runs ``copy_globs`` against both source dirs. Separate dirs avoid the "ambiguous archive layout" guard tripping on shared filenames like LICENSE.txt, while the second ``copy_globs`` overlay drops the cudart DLLs into the same ``install_dir/build/bin/Release`` directory as the main binary. Adds a ``runtime_sha256`` field on AssetChoice to carry the verified hash through to the download step, alongside the existing ``runtime_name`` / ``runtime_url`` slots. Tests: 5 new cases in tests/studio/install/test_selection_logic.py: * upstream pairing populates runtime_url / runtime_name * graceful degrade when cudart asset is absent in the release * legacy cudart-only naming path does not self-pair * apply_approved_hashes threads runtime_sha256 when the manifest lists it * apply_approved_hashes drops the pair when the runtime hash is missing rather than installing without verification 130 install tests pass (125 baseline + 5 new). No regressions. Refs #5106

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b2aeed578

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T10:41:05Z

    runtime_name: str | None = None
    runtime_url: str | None = None
+    runtime_sha256: str | None = None


Include runtime bundle in install cache validation

When a Windows CUDA choice now has a paired runtime archive, an existing install made before this change can still be considered up to date because expected_install_fingerprint()/expected_pairs only key off the main asset and runtime_line, while runtime_payload_health_groups() for windows-cuda only checks llama.dll and ggml-cuda.dll. In that scenario Studio skips the reinstall before install_from_archives() ever downloads cudart-llama...zip, leaving exactly the broken installs without cudart/cublas DLLs unaffected; include the runtime archive/hash in the fingerprint or explicitly require the paired DLLs so affected installs are refreshed.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request implements support for paired runtime archives, specifically for Windows CUDA prebuilts, to ensure necessary DLLs like cudart and cublas are included during installation. The changes modify the AssetChoice structure, update asset selection logic to pair binary archives with their corresponding runtime bundles, and enhance the installation process to download and overlay these archives with checksum verification. New tests have been added to validate the pairing logic and hash verification. I have no feedback to provide.

expected_install_fingerprint did not hash the new runtime_name / runtime_sha256 fields, and runtime_payload_health_groups for windows- cuda only checked llama.dll / ggml-cuda.dll. The combination meant that an install made before this PR -- the exact installs reporting #5106 -- would still match the post-PR choice: same main asset name + sha, same llama.dll, same ggml-cuda.dll, missing cudart64_*.dll, but existing_install_matches_choice returned True and the cudart download path in install_from_archives never ran. Fresh installs got the fix; existing affected installs did not. This commit: * Adds runtime_asset and runtime_sha256 to the fingerprint payload so any change to (or first introduction of) the cudart pair invalidates pre-existing installs. * Refactors write_prebuilt_metadata to call expected_install_fingerprint so the recorded fingerprint cannot drift from the expected one when new keys are added. * Extends runtime_payload_health_groups for windows-cuda to require cudart64_*.dll and cublas64_*.dll *only when the choice carries a paired runtime archive*. Gating on choice.runtime_name keeps the no-pair fallback path (manifest missing cudart hash, upstream without paired bundle) from looping on reinstall. New tests: * test_existing_install_matches_plan_windows_cuda_paired_requires_cudart -- paired choice rejects installs missing cudart / cublas. * test_existing_install_matches_plan_windows_cuda_unpaired_skips_cudart_check -- unpaired choice still accepts legacy cudart-less installs. * test_existing_install_fingerprint_changes_when_cudart_pair_added -- direct fingerprint mismatch between the legacy and paired choice. Refs #5106

danielhanchen · 2026-05-11T10:56:19Z

Pushed 526894a4 to address the install-cache gap (matches the codex bot P1 comment at install_llama_prebuilt.py:213).

The mechanism in the PR was correct end-to-end -- windows_cuda_upstream_asset_names orders the binary archive first so the selected_name.startswith("llama-") guard pairs only real binary archives, overlay_directory_for_choice puts the runtime overlay in install_dir/build/bin/Release next to llama-server.exe, runtime_patterns_for_choice["windows-cuda"] is ["*.exe", "*.dll"] which matches the cudart DLLs, and the separate runtime_extract_dir correctly sidesteps the ambiguous-layout guard on LICENSE.txt. The supply-chain story (apply_approved_hashes drops the pair when the manifest lacks the runtime sha) is also fine.

The remaining hole was that an existing pre-PR install would still match the post-PR choice and skip the reinstall that drops the cudart DLLs in:

expected_install_fingerprint did not hash runtime_name / runtime_sha256, so the fingerprint did not change when the choice gained a cudart pair.
runtime_payload_health_groups["windows-cuda"] only checked llama.dll / ggml-cuda.dll, so a directory that already contained those two DLLs (but no cudart) still passed the health check.

Combined, existing_install_matches_choice returned True for the very installs that filed this issue, so the cudart overlay never ran.

What the commit changes:

expected_install_fingerprint now hashes runtime_asset (choice.runtime_name) and runtime_sha256. Pre-PR installs hash both as None; post-PR paired installs hash the cudart sha. The fingerprint flips and the install is refreshed.
write_prebuilt_metadata now calls expected_install_fingerprint directly instead of inlining a parallel payload, so the recorded and the expected fingerprint cannot drift again when new fields are added.
runtime_payload_health_groups["windows-cuda"] adds cudart64_*.dll and cublas64_*.dll requirements only when choice.runtime_name is set. Gating on the choice keeps the no-pair fallback (manifest missing the cudart hash, upstream without a paired bundle) from looping on reinstall forever.

New tests in tests/studio/install/test_install_llama_prebuilt_logic.py:

test_existing_install_matches_plan_windows_cuda_paired_requires_cudart -- paired choice rejects installs missing cudart / cublas.
test_existing_install_matches_plan_windows_cuda_unpaired_skips_cudart_check -- unpaired choice still accepts legacy cudart-less installs (no reinstall loop).
test_existing_install_fingerprint_changes_when_cudart_pair_added -- direct fingerprint comparison between the legacy and paired choice.

Full suite stays green: tests/studio/install/ 260 passed.

Minor nit (not addressed): selected_name.startswith(f"llama-") at install_llama_prebuilt.py:2935 has a pointless f-prefix; the other site at :3023 uses plain "llama-". Trivial cleanup, your call.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90431b9982

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T11:01:31Z

+            groups.append(["cudart64_*.dll"])
+            groups.append(["cublas64_*.dll"])


Require cublasLt DLL for paired Windows CUDA installs

When runtime_name is set, the health check only requires cudart64_*.dll and cublas64_*.dll, but not cublasLt64_*.dll even though the paired cudart bundle includes it. A partially deleted/corrupted install that is missing only cublasLt will still be treated as up-to-date and skip reinstall, which can leave CUDA backend initialization broken at runtime despite this commit’s paired-runtime validation path.

Useful? React with 👍 / 👎.

Three follow-ups from a 12-reviewer batch over 526894a (PR #5322): 1. (12/12) Health check required cudart64_*.dll and cublas64_*.dll but not cublasLt64_*.dll. The upstream cudart-llama-bin-win-cuda-X.Y-x64 bundle ships all three (verified against b9103 cuda-12.4 and cuda-13.1: 3 DLLs, no executables), and a Windows install missing any one of them still fails CUDA initialisation. Adding cublasLt64_*.dll to runtime_payload_health_groups so a partial install or a deletion of the third DLL triggers reinstall instead of silently staying broken. 2. The runtime overlay copy used the same broad runtime_patterns_for_choice set as the main archive (windows-cuda returns *.exe and *.dll). A malformed runtime zip that contained a llama-server.exe alongside the real cudart DLLs would have overwritten the main archive's server binary. Introduced paired_runtime_dll_patterns() that returns the cudart bundle's three specific filename patterns and nothing else, and use that for the second copy_globs pass. New end-to-end regression test packs a fake runtime zip with an extra llama-server.exe and asserts the main binary survives. 3. (7/12) python_runtime_dirs in install_llama_prebuilt.py and _windows_pip_nvidia_dll_dirs in llama_cpp.py walked different path sets. The installer side missed nvidia/<pkg>/Library/bin (conda layout) and nvidia/<pkg>/bin/x86_64 (current CUDA 13 unsuffixed wheel layout), so preflight CUDA detection could fail even when usable DLLs were present. Mirrored the same six-path set the backend resolver uses, including arch subdirs. New tests: - test_paired_runtime_dll_patterns_excludes_executables - test_runtime_overlay_cannot_overwrite_main_archive_payload (end-to-end) - test_python_runtime_dirs_covers_cu13_and_library_bin - extended test_existing_install_matches_plan_windows_cuda_paired_requires_cudart with a cublasLt-missing case Upstream cudart bundle contents verified empirically by downloading the b9103 release artifacts directly: each cuda-X.Y bundle contains exactly cudart64_X.dll + cublas64_X.dll + cublasLt64_X.dll, no exes. Refs #5106

for more information, see https://pre-commit.ci

danielhanchen · 2026-05-11T12:09:20Z

Pushed 8b1288f7 to address the three real findings from a 12-reviewer batch over 526894a4:

cublasLt64_*.dll missing from health check (12/12 reviewers agreed). The upstream cudart-llama-bin-win-cuda-X.Y-x64.zip bundle ships exactly three DLLs -- verified by downloading b9103's cuda-12.4 and cuda-13.1 artifacts directly: cudart64_X.dll + cublas64_X.dll + cublasLt64_X.dll, no executables. A user missing only cublasLt64_*.dll (partial deletion, antivirus quarantine, etc) still fails CUDA initialisation; the health check now requires all three when choice.runtime_name is set.
Runtime overlay overly broad. The second copy_globs pass used runtime_patterns_for_choice(choice) which for windows-cuda is ["*.exe", "*.dll"]. A hypothetical malformed cudart zip containing llama-server.exe alongside the cudart DLLs would overwrite the main archive's server binary. Replaced with a new paired_runtime_dll_patterns(choice) helper that returns exactly the cudart bundle's three filename patterns. End-to-end test packs a runtime zip with a fake llama-server.exe and asserts the main binary survives.
Installer / backend resolver asymmetry (7/12 + a related cu13 finding). python_runtime_dirs in install_llama_prebuilt.py walked nvidia/*/lib, nvidia/*/bin, torch/lib, while the backend's _windows_pip_nvidia_dll_dirs (from Studio: add torch's pip nvidia DLL dirs to PATH on Windows #5324) walks the broader Windows wheel layouts. Mirrored the six-path set so installer preflight CUDA detection now also covers:
- nvidia/<pkg>/bin/x86_64 and nvidia/<pkg>/bin/x64 (current CUDA 13 layout used by the unsuffixed nvidia-cuda-runtime 13.x and nvidia-cublas 13.x packages; verified empirically by pip download nvidia-cuda-runtime --platform win_amd64 which produces nvidia/cu13/bin/x86_64/cudart64_13.dll).
- nvidia/<pkg>/Library/bin (conda-style repacks) and its arch subdirs.

The lower-priority reviewer findings (P1 version-agnostic glob, P3 contradictory log line) were assessed as low-probability / cosmetic; deferred for a follow-up PR if you want them.

New tests:

test_paired_runtime_dll_patterns_excludes_executables -- gate on the helper.
test_runtime_overlay_cannot_overwrite_main_archive_payload -- end-to-end on install_from_archives with a malformed runtime zip.
test_python_runtime_dirs_covers_cu13_and_library_bin -- installer-side parity with the backend resolver.
Extended test_existing_install_matches_plan_windows_cuda_paired_requires_cudart with the cublasLt64_*.dll-missing case.

Regression: 263 install-suite tests pass (was 260 before this commit). Behavioural simulation harness in temp/pr_simulation/ exercises 35 scenarios (combined fresh-install, existing-install, fingerprint, health, PATH assembly, ZIP traversal, install-lock, resolver edge cases, cu13 layout, glob-meta safety, narrow-overlay regression) and all 35 pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7447ad558b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T12:12:28Z

+        "runtime_asset": choice.runtime_name,
+        "runtime_sha256": choice.runtime_sha256,


Limit runtime fingerprint fields to paired CUDA installs

expected_install_fingerprint now always includes runtime_asset and runtime_sha256, even when a choice has no paired runtime archive. Because metadata written before this commit used a fingerprint payload without those keys, every pre-existing install (including Linux/macOS and unpaired Windows bundles) will fail recorded_fingerprint != expected_fingerprint and be treated as stale, forcing unnecessary reinstall/validation work and avoidable network dependence. This cache invalidation should be scoped to paired Windows CUDA choices only.

Useful? React with 👍 / 👎.

* Studio: pin GPU at 95% headroom and warn on silent CPU fallback Two related runtime-side fixes for #5106 ("model loaded fully on RAM instead of VRAM"): 1. GPU pin threshold bump 0.90 -> 0.95 ------------------------------------- ``_select_gpus`` and the auto-ctx pin loop in ``start_llama_server`` used a ``pool * 0.90`` threshold to decide whether the model fits on GPU. Models that needed 91-94% of free VRAM were classified as "does not fit", so Studio set ``gpu_indices = None`` and shipped ``--fit on`` to llama-server without ``-ngl``. The unsloth llama.cpp fork's ``--fit on`` then ran with its default ``--fit-target 1024`` (1 GiB margin per device, an upstream default inherited from ggml-org#18679). On a tight fit where compute buffers + CUDA context push the projected free below the 1 GiB target, the fork's fit logic shaves layer weights off the GPU -- slow inference for users whose models would have loaded comfortably with ``-ngl -1``. The classic reproducer from #5106 (noahterbest's log): GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096, GPUs free: [(0, 22805)], selected: None, fit: True 20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit mode. Bumping to 0.95 keeps these in the fits-on-GPU branch and emits ``-ngl -1`` directly. The fork's ``--fit on`` still serves as the safety net for the genuinely-too-large case. The auto-ctx fallback also re-checks fit at 4096 before handing off to ``--fit on``: a 20.8 GiB model with a 131072 native context fails the auto loop at native ctx, falls back to ``min(4096, ctx)``, but its weights + 4096 KV pin to the GPU comfortably. Without the re-check we still emitted ``--fit on``. ``_fit_context_to_vram``'s 0.90 budget for context binary search is intentionally left tighter than the pin fraction. That routine chooses the slider value, where over-promising would OOM at runtime. ``_select_gpus`` decides whether to pin at all, where being conservative pushes layers to CPU. 2. Belt-and-suspenders: warn on silent CPU fallback --------------------------------------------------- After ``_wait_for_health`` succeeds, scan llama-server's stdout for ``model buffer size`` lines. If Studio detected GPUs and intended GPU use but only CPU buffers were allocated, log a structured warning citing #5106. Markers cover CUDA / ROCm / Metal / Vulkan / OpenCL / SYCL backends. New ``_gpu_offload_active: Optional[bool]`` field surfaces the result for any future API consumer. This catches runtime-load failures the install-time fix cannot cover (cudart bundle pairing PR #5322 is the install-side companion): user overriding ``--fit-target``, uncommon driver + toolkit configurations, future regressions in the install path. Tests: 10 new cases in studio/backend/tests/test_llama_cpp_context_fit.py: * TestTightFitPinsToGPU x3: noahterbest's exact reproducer (auto and explicit ctx pins to GPU at 94%); guard against threshold over- broadening (genuine overflow still falls back to ``--fit on``). * TestClassifyGpuOffload x7: CUDA / ROCm / Metal buffer markers return True; CPU-only buffer lines return False; absent buffer lines or no GPUs detected return None (no warning). 25 context-fit tests pass (15 baseline + 10 new). 511 tests total across the affected test files. No regressions. Refs #5106 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Trim comments to be more succinct --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…tic CI test (#5376) * tests/studio: end-to-end Windows GPU detection mock test (#5106) Locks in the combined fix from #5322 + #5324 with a synthetic Windows scenario that CI runners without GPUs can execute. The test packs the real PyPI win_amd64 wheel layouts (cu12 modular and the new unsuffixed cu13 nvidia/cu13/bin/x86_64 layout) plus the exact filename set of the upstream b9103 cudart-llama-bin-win-cuda bundles, then mocks nvidia-smi output and asserts that: * Studio's nvidia-smi probe parses the CSV and reports the GPU. * After PR #5322 the install_dir/build/bin/Release/ tree contains all three cudart bundle DLLs alongside llama-server.exe. * After PR #5324 the PATH built by start_llama_server's win32 branch lists pip nvidia + torch/lib dirs in addition to the binary_dir. * cudart64_X.dll, cublas64_X.dll, and cublasLt64_X.dll are each reachable from at least one PATH entry, with cudart specifically reachable from BOTH the install dir and a pip nvidia dir (defence in depth). * Bare venvs without pip nvidia wheels still work via #5322's binary_dir drop; pre-#5322 installs still work via #5324's PATH augmentation. * A reconstructed pre-PR scenario (cudart absent from binary_dir and pip dirs not on PATH) leaves cudart unreachable, confirming the test would catch a future regression. Bonus housekeeping in studio/install_llama_prebuilt.py: drop the pointless f-prefix on the literal "llama-" in the windows_cuda_attempts pairing guard (no behaviour change; lint nit flagged in the post-merge review). The mocks model real artifact contents I verified empirically: * pip download nvidia-cuda-runtime --platform win_amd64 produces nvidia/cu13/bin/x86_64/cudart64_13.dll. * unzip on the b9103 cudart-llama-bin-win-cuda-13.1-x64.zip produces exactly cudart64_13.dll + cublas64_13.dll + cublasLt64_13.dll, no executables. * objdump -p on the b9103 ggml-cuda.dll shows a static PE import on cublas64_13.dll (the root cause of #5106 when cublas64_13.dll is unreachable). Refs #5106 #5322 #5324 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test_5106_windows_gpu_detection_mock: don't shadow real httpx This file's name sorts before every other file in studio/backend/tests/ (starts with the digit '5'), so pytest collects it first. The previous ``sys.modules.setdefault("httpx", _httpx_stub)`` ran before any other test imported real httpx, which meant the stub permanently shadowed the real module for the rest of the collection. Tests that did ``from httpx import HTTPError, Response`` (test_anthropic_messages, test_browse_folders_route, test_training_*, etc) then failed at collection with ``ImportError: cannot import name 'HTTPError'`` because the stub did not define those names. The existing test_llama_cpp_windows_nvidia_path.py did not trigger the same issue because it sorts after test_a* / test_b* / etc, by which point the real httpx has already been imported and setdefault is a no-op. Switch the stub installation to ``importlib.util.find_spec(name) is None`` so we only fall back to the stub when the real module truly is not installed. Backend CI installs httpx, structlog, and the studio/backend/loggers package is reachable via the sys.path augmentation a few lines above, so on CI all three find_spec calls succeed and no stubs are installed at all. Also add HTTPError and Response to the stub module for the offline case, so anyone running this test outside CI with httpx absent still gets a stub that satisfies the broader test suite's imports. Refs #5106 * test_5106 + llama_cpp: extract win32 PATH helper and harden the regression test Follow-up to PR #5376's review feedback. Three real findings from the bot reviewers, plus one stale one. 1. (codex P2 line 201, gemini medium line 209) The regression test's _build_path_dirs_like_start_llama_server hand-copied the win32 branch of LlamaCppBackend.start_llama_server, so a future drop or reorder of _windows_pip_nvidia_dll_dirs(sys.prefix) in production would have passed the test silently. Extract a new staticmethod LlamaCppBackend._build_windows_path_dirs (binary_dir, prefix, cuda_path). Production start_llama_server now calls this helper. The test's wrapper is reduced to a one-line delegate that forwards to the staticmethod, so the regression asserts against the exact production logic instead of a parallel copy of it. 2. (codex P2 line 245) test_nvidia_smi_probe_reports_synthetic_gpu did not clear CUDA_VISIBLE_DEVICES. On a shared GPU runner with the variable set in the parent shell, _get_gpu_free_memory() filters the mocked CSV and returns [] or falls through to the torch fallback. Cleared CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES via monkeypatch.delenv(..., raising=False). 3. (codex P2 line 66) _maybe_stub gated on importlib.util.find_spec ("loggers"), which returns a spec because studio/backend/loggers/ is on sys.path. But the actual import chain loads loggers/handlers.py which does `from fastapi import Request, Response` at module load. In a lightweight env without fastapi installed, the stub never lands and `from core.inference.llama_cpp import LlamaCppBackend` raises during collection. Switched _maybe_stub to a real import attempt under try / except ImportError so the stub falls into place when the package is discoverable but not importable. CI has fastapi so this is purely a developer- machine ergonomics fix. The fourth comment (codex P1 line 85 "Keep the httpx stub from leaking across tests") was already addressed by 7437e73, which replaced the unconditional sys.modules.setdefault with the find_spec-gated _maybe_stub. No code change needed. Production behaviour is unchanged: _build_windows_path_dirs returns exactly the same ordering start_llama_server used inline ([binary_dir, *pip_dirs, cuda_bin?, cuda_bin_x64?]). Verification (run inside studio/backend): pytest tests/test_5106_windows_gpu_detection_mock.py -v -> 10 passed pytest tests/test_llama_cpp_*.py tests/test_llama_server_args.py tests/test_5106_windows_gpu_detection_mock.py -q -> 171 passed CUDA_VISIBLE_DEVICES=1 pytest tests/test_5106_windows_gpu_detection_mock.py::TestWindowsGpuDetectionAfter5106Fix::test_nvidia_smi_probe_reports_synthetic_gpu -> 1 passed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename Windows GPU detection test to a generic filename and trim comments - studio/backend/tests/test_5106_windows_gpu_detection_mock.py -> studio/backend/tests/test_windows_gpu_detection_mock.py The file is the generic regression suite for Windows GPU detection; encoding the issue number in the filename is noise. - Shorten module docstring, helper docstrings, per-test docstrings and inline comments in the renamed test file. No behaviour change, all 10 cases still pass. - Shorten the _build_windows_path_dirs docstring in studio/backend/core/inference/llama_cpp.py and update the test-path reference; trim the win32 call-site comment to one line. Local verification: - pytest studio/backend/tests/test_windows_gpu_detection_mock.py -- 10 passed. - pytest studio/backend/tests/test_llama_cpp_windows_nvidia_path.py studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_windows_gpu_detection_mock.py -- 110 passed. * Studio: harden _wait_for_health against transient httpx ReadError The probe loop in LlamaCppBackend._wait_for_health only caught ConnectError and TimeoutException. On Windows, when llama-server.exe accepts the TCP probe and then dies before sending HTTP headers, the peer process RST closes the socket. httpx maps this to ReadError ("WinError 10054 -- An existing connection was forcibly closed by the remote host"), which fell through the except clause and bubbled out of _wait_for_health, the routes/inference.py load_model handler, and back to /api/inference/load as an opaque 500. The crash diagnostic Studio actually wants to surface lives on the self._process.poll() branch at the top of the loop body: "llama-server exited with code X. Output: ...". We never reached that branch on the WinError 10054 path because the very first probe blew up. Expand the except to also swallow ReadError and RemoteProtocolError so the next 0.5-second iteration runs the poll() branch. Outcomes: * Process really died: structured exit-code + last-stdout log line. * Single transient probe blip: silently retried; load succeeds. Adds studio/backend/tests/test_llama_cpp_wait_for_health.py with five cases covering happy-path 200, transient ReadError + dead process, RemoteProtocolError + dead process, ConnectError cycling until success, and dead process before the first probe. The new cases would have failed against the old except clause -- ReadError / RemoteProtocolError would have propagated instead of returning False. Found while triaging the Windows Studio GGUF CI flake on this PR's 5a6ddc3 push: llama-server.exe (b9203 prebuilt) crashed within 2.2 s of launch on the GPU-less runner, and Studio reported "WinError 10054" instead of an upstream-tag-attributable exit-code line. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: danielhanchen <michaelhan2050@gmail.com>

danielhanchen requested a review from rolandtannous as a code owner May 7, 2026 10:38

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b2aeed

for more information, see https://pre-commit.ci

danielhanchen mentioned this pull request May 7, 2026

Studio: pin GPU at 95% headroom and warn on silent CPU fallback #5323

Merged

5 tasks

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Trim comments to be more succinct

d4f3752

danielhanchen added 2 commits May 11, 2026 03:20

Merge branch 'main' into dh/fix-5106-windows-cudart-pair

cf7179a

[pre-commit.ci] auto fixes from pre-commit.com hooks

90431b9

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

danielhanchen and others added 2 commits May 11, 2026 12:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

7447ad5

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

danielhanchen merged commit e346193 into main May 11, 2026
33 of 34 checks passed

danielhanchen deleted the dh/fix-5106-windows-cudart-pair branch May 11, 2026 12:42

danielhanchen mentioned this pull request May 11, 2026

tests/studio: lock in Windows GPU detection fix (#5106) with a synthetic CI test #5376

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Studio: download paired cudart bundle on Windows CUDA installs#5322

Studio: download paired cudart bundle on Windows CUDA installs#5322
danielhanchen merged 8 commits into
mainfrom
dh/fix-5106-windows-cudart-pair

danielhanchen commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

danielhanchen commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

danielhanchen commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		groups.append(["cudart64_*.dll"])
		groups.append(["cublas64_*.dll"])

		"runtime_asset": choice.runtime_name,
		"runtime_sha256": choice.runtime_sha256,

Uh oh!

Conversation

danielhanchen commented May 7, 2026

Summary

What changed

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant