Skip to content

Studio: download paired cudart bundle on Windows CUDA installs#5322

Merged
danielhanchen merged 8 commits into
mainfrom
dh/fix-5106-windows-cudart-pair
May 11, 2026
Merged

Studio: download paired cudart bundle on Windows CUDA installs#5322
danielhanchen merged 8 commits into
mainfrom
dh/fix-5106-windows-cudart-pair

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Fixes the Windows half of #5106 (multiple users reporting "GPU detected but model loads entirely on RAM/CPU"). Studio's installer was downloading only the main llama-<tag>-bin-win-cuda-X.Y-x64.zip archive, but upstream ggml-org/llama.cpp ships Windows CUDA in two archives that the release notes explicitly require both of:

  • llama-<tag>-bin-win-cuda-X.Y-x64.zip (binaries + ggml DLLs)
  • cudart-llama-bin-win-cuda-X.Y-x64.zip (cudart64_X.dll, cublas64_X.dll, cublasLt64_X.dll)

The AssetChoice.runtime_name / runtime_url fields existed but were never populated, and install_from_archives only handled choice.url. With cudart DLLs missing from install_dir/build/bin/Release, the prebuilt's LoadLibrary calls only resolved when the user happened to have a version-matched system CUDA toolkit on PATH. That is why nvidia-smi reported the GPU and Studio's nvidia-smi probe reported free VRAM, but llama-server's CUDA backend silently failed to initialize and the model loaded entirely on CPU regardless of -ngl or --fit on.

What changed

  • windows_cuda_attempts and published_windows_cuda_attempts now look up the paired cudart-llama-bin-win-cuda-X.Y-x64.zip URL alongside the main archive and store it as runtime_url / runtime_name on the AssetChoice. Pairing only happens when the selected main archive is the binary archive (name starts with llama-); the legacy cudart-only naming path stays intact.
  • apply_approved_hashes resolves the runtime archive's hash from the approved manifest. If the manifest does not list the runtime archive, the pairing is dropped rather than installing an unverified runtime. Published bundles keep their supply-chain guarantee; upstream installs without a manifest sit on the same risk surface as the existing main-archive download.
  • install_from_archives downloads the runtime archive into its own temp dir and runs copy_globs against both source dirs. Separate dirs avoid the "ambiguous archive layout" guard tripping on shared filenames like LICENSE.txt, while the second copy_globs overlay drops the cudart DLLs alongside llama-server.exe in install_dir/build/bin/Release.
  • Added a runtime_sha256 field on AssetChoice to carry the verified hash through to the download step.

Test plan

  • python -m pytest tests/studio/install/test_selection_logic.py (130 passed: 125 baseline + 5 new)
  • python -m pytest tests/studio/install/test_install_llama_prebuilt_logic.py tests/studio/install/test_pr4562_bugfixes.py tests/studio/install/test_llama_pr_force_and_source.py (all green)
  • Validate on a Windows host without a system CUDA toolkit: fresh install picks up cudart DLLs from the release and nvidia-smi shows VRAM usage during inference
  • Validate on a Windows host that previously hit [Bug] Not Unsloth Studio detects my GPU But not use but only uses CPU/RAM #5106 (driver CUDA 13.2, no toolkit): no longer needs winget install Nvidia.CUDA

New tests added in tests/studio/install/test_selection_logic.py:

  • test_cudart_runtime_archive_is_paired -- upstream attempts populate runtime_url
  • test_no_runtime_archive_when_cudart_absent -- graceful degrade when cudart asset is missing in the release
  • test_cudart_only_assets_do_not_self_pair -- legacy cudart-only naming path stays unchanged
  • test_runtime_hash_threaded_when_present -- apply_approved_hashes resolves runtime_sha256 from the manifest
  • test_runtime_pair_dropped_when_hash_missing -- pairing is dropped when the manifest lacks the runtime hash

Refs #5106

Upstream ggml-org/llama.cpp publishes Windows CUDA in two archives
that the release notes explicitly say are both required:

  llama-<tag>-bin-win-cuda-X.Y-x64.zip       (binaries + ggml DLLs)
  cudart-llama-bin-win-cuda-X.Y-x64.zip      (cudart64, cublas64, cublasLt64)

Studio's installer was downloading only the first one. The
``runtime_name`` / ``runtime_url`` fields on AssetChoice existed but
were never populated, and ``install_from_archives`` only handled
``choice.url``. With the cudart DLLs missing from
``install_dir/build/bin/Release``, the prebuilt binary's LoadLibrary
calls only resolved at runtime when the user happened to have a
version-matched system CUDA toolkit on PATH. That is the underlying
cause for the Windows reports in #5106 ("GPU detected but model
loaded entirely on RAM"): the prebuilt's CUDA backend silently fails
to load and llama-server falls back to CPU regardless of ``-ngl`` or
``--fit on``.

Wires the pairing through end to end:

* ``windows_cuda_attempts`` and ``published_windows_cuda_attempts``
  look up the matching ``cudart-llama-bin-win-cuda-X.Y-x64.zip``
  asset URL alongside the main archive and store it as
  ``runtime_url`` / ``runtime_name`` on the AssetChoice. We only
  pair when the selected main archive is the binary archive
  (``llama-...zip``) so the legacy cudart-only naming path is
  unaffected.

* ``apply_approved_hashes`` resolves the runtime archive's hash from
  the approved manifest. If the manifest does not list the runtime
  archive, the pairing is dropped rather than installing without
  checksum coverage. Preserves the supply-chain guarantee for
  published bundles; upstream installs with no manifest are
  unaffected (same risk surface as the existing main-archive
  download).

* ``install_from_archives`` now downloads the runtime archive into a
  separate temp dir and runs ``copy_globs`` against both source dirs.
  Separate dirs avoid the "ambiguous archive layout" guard tripping
  on shared filenames like LICENSE.txt, while the second
  ``copy_globs`` overlay drops the cudart DLLs into the same
  ``install_dir/build/bin/Release`` directory as the main binary.

Adds a ``runtime_sha256`` field on AssetChoice to carry the
verified hash through to the download step, alongside the existing
``runtime_name`` / ``runtime_url`` slots.

Tests: 5 new cases in tests/studio/install/test_selection_logic.py:
* upstream pairing populates runtime_url / runtime_name
* graceful degrade when cudart asset is absent in the release
* legacy cudart-only naming path does not self-pair
* apply_approved_hashes threads runtime_sha256 when the manifest
  lists it
* apply_approved_hashes drops the pair when the runtime hash is
  missing rather than installing without verification

130 install tests pass (125 baseline + 5 new). No regressions.

Refs #5106

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b2aeed578

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 217 to +219
runtime_name: str | None = None
runtime_url: str | None = None
runtime_sha256: str | None = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include runtime bundle in install cache validation

When a Windows CUDA choice now has a paired runtime archive, an existing install made before this change can still be considered up to date because expected_install_fingerprint()/expected_pairs only key off the main asset and runtime_line, while runtime_payload_health_groups() for windows-cuda only checks llama.dll and ggml-cuda.dll. In that scenario Studio skips the reinstall before install_from_archives() ever downloads cudart-llama...zip, leaving exactly the broken installs without cudart/cublas DLLs unaffected; include the runtime archive/hash in the fingerprint or explicitly require the paired DLLs so affected installs are refreshed.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for paired runtime archives, specifically for Windows CUDA prebuilts, to ensure necessary DLLs like cudart and cublas are included during installation. The changes modify the AssetChoice structure, update asset selection logic to pair binary archives with their corresponding runtime bundles, and enhance the installation process to download and overlay these archives with checksum verification. New tests have been added to validate the pairing logic and hash verification. I have no feedback to provide.

expected_install_fingerprint did not hash the new runtime_name /
runtime_sha256 fields, and runtime_payload_health_groups for windows-
cuda only checked llama.dll / ggml-cuda.dll. The combination meant that
an install made before this PR -- the exact installs reporting #5106 --
would still match the post-PR choice: same main asset name + sha, same
llama.dll, same ggml-cuda.dll, missing cudart64_*.dll, but
existing_install_matches_choice returned True and the cudart download
path in install_from_archives never ran. Fresh installs got the fix;
existing affected installs did not.

This commit:
 * Adds runtime_asset and runtime_sha256 to the fingerprint payload so
   any change to (or first introduction of) the cudart pair invalidates
   pre-existing installs.
 * Refactors write_prebuilt_metadata to call expected_install_fingerprint
   so the recorded fingerprint cannot drift from the expected one when
   new keys are added.
 * Extends runtime_payload_health_groups for windows-cuda to require
   cudart64_*.dll and cublas64_*.dll *only when the choice carries a
   paired runtime archive*. Gating on choice.runtime_name keeps the
   no-pair fallback path (manifest missing cudart hash, upstream
   without paired bundle) from looping on reinstall.

New tests:
 * test_existing_install_matches_plan_windows_cuda_paired_requires_cudart
   -- paired choice rejects installs missing cudart / cublas.
 * test_existing_install_matches_plan_windows_cuda_unpaired_skips_cudart_check
   -- unpaired choice still accepts legacy cudart-less installs.
 * test_existing_install_fingerprint_changes_when_cudart_pair_added
   -- direct fingerprint mismatch between the legacy and paired choice.

Refs #5106
@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed 526894a4 to address the install-cache gap (matches the codex bot P1 comment at install_llama_prebuilt.py:213).

The mechanism in the PR was correct end-to-end -- windows_cuda_upstream_asset_names orders the binary archive first so the selected_name.startswith("llama-") guard pairs only real binary archives, overlay_directory_for_choice puts the runtime overlay in install_dir/build/bin/Release next to llama-server.exe, runtime_patterns_for_choice["windows-cuda"] is ["*.exe", "*.dll"] which matches the cudart DLLs, and the separate runtime_extract_dir correctly sidesteps the ambiguous-layout guard on LICENSE.txt. The supply-chain story (apply_approved_hashes drops the pair when the manifest lacks the runtime sha) is also fine.

The remaining hole was that an existing pre-PR install would still match the post-PR choice and skip the reinstall that drops the cudart DLLs in:

  • expected_install_fingerprint did not hash runtime_name / runtime_sha256, so the fingerprint did not change when the choice gained a cudart pair.
  • runtime_payload_health_groups["windows-cuda"] only checked llama.dll / ggml-cuda.dll, so a directory that already contained those two DLLs (but no cudart) still passed the health check.

Combined, existing_install_matches_choice returned True for the very installs that filed this issue, so the cudart overlay never ran.

What the commit changes:

  1. expected_install_fingerprint now hashes runtime_asset (choice.runtime_name) and runtime_sha256. Pre-PR installs hash both as None; post-PR paired installs hash the cudart sha. The fingerprint flips and the install is refreshed.
  2. write_prebuilt_metadata now calls expected_install_fingerprint directly instead of inlining a parallel payload, so the recorded and the expected fingerprint cannot drift again when new fields are added.
  3. runtime_payload_health_groups["windows-cuda"] adds cudart64_*.dll and cublas64_*.dll requirements only when choice.runtime_name is set. Gating on the choice keeps the no-pair fallback (manifest missing the cudart hash, upstream without a paired bundle) from looping on reinstall forever.

New tests in tests/studio/install/test_install_llama_prebuilt_logic.py:

  • test_existing_install_matches_plan_windows_cuda_paired_requires_cudart -- paired choice rejects installs missing cudart / cublas.
  • test_existing_install_matches_plan_windows_cuda_unpaired_skips_cudart_check -- unpaired choice still accepts legacy cudart-less installs (no reinstall loop).
  • test_existing_install_fingerprint_changes_when_cudart_pair_added -- direct fingerprint comparison between the legacy and paired choice.

Full suite stays green: tests/studio/install/ 260 passed.

Minor nit (not addressed): selected_name.startswith(f"llama-") at install_llama_prebuilt.py:2935 has a pointless f-prefix; the other site at :3023 uses plain "llama-". Trivial cleanup, your call.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90431b9982

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +5134 to +5135
groups.append(["cudart64_*.dll"])
groups.append(["cublas64_*.dll"])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require cublasLt DLL for paired Windows CUDA installs

When runtime_name is set, the health check only requires cudart64_*.dll and cublas64_*.dll, but not cublasLt64_*.dll even though the paired cudart bundle includes it. A partially deleted/corrupted install that is missing only cublasLt will still be treated as up-to-date and skip reinstall, which can leave CUDA backend initialization broken at runtime despite this commit’s paired-runtime validation path.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits May 11, 2026 12:08
Three follow-ups from a 12-reviewer batch over 526894a (PR #5322):

1. (12/12) Health check required cudart64_*.dll and cublas64_*.dll but
   not cublasLt64_*.dll. The upstream cudart-llama-bin-win-cuda-X.Y-x64
   bundle ships all three (verified against b9103 cuda-12.4 and
   cuda-13.1: 3 DLLs, no executables), and a Windows install missing
   any one of them still fails CUDA initialisation. Adding
   cublasLt64_*.dll to runtime_payload_health_groups so a partial
   install or a deletion of the third DLL triggers reinstall instead
   of silently staying broken.

2. The runtime overlay copy used the same broad runtime_patterns_for_choice
   set as the main archive (windows-cuda returns *.exe and *.dll). A
   malformed runtime zip that contained a llama-server.exe alongside
   the real cudart DLLs would have overwritten the main archive's
   server binary. Introduced paired_runtime_dll_patterns() that
   returns the cudart bundle's three specific filename patterns and
   nothing else, and use that for the second copy_globs pass.
   New end-to-end regression test packs a fake runtime zip with an
   extra llama-server.exe and asserts the main binary survives.

3. (7/12) python_runtime_dirs in install_llama_prebuilt.py and
   _windows_pip_nvidia_dll_dirs in llama_cpp.py walked different path
   sets. The installer side missed nvidia/<pkg>/Library/bin (conda
   layout) and nvidia/<pkg>/bin/x86_64 (current CUDA 13 unsuffixed
   wheel layout), so preflight CUDA detection could fail even when
   usable DLLs were present. Mirrored the same six-path set the
   backend resolver uses, including arch subdirs.

New tests:
 - test_paired_runtime_dll_patterns_excludes_executables
 - test_runtime_overlay_cannot_overwrite_main_archive_payload (end-to-end)
 - test_python_runtime_dirs_covers_cu13_and_library_bin
 - extended test_existing_install_matches_plan_windows_cuda_paired_requires_cudart
   with a cublasLt-missing case

Upstream cudart bundle contents verified empirically by downloading
the b9103 release artifacts directly: each cuda-X.Y bundle contains
exactly cudart64_X.dll + cublas64_X.dll + cublasLt64_X.dll, no exes.

Refs #5106
@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed 8b1288f7 to address the three real findings from a 12-reviewer batch over 526894a4:

  1. cublasLt64_*.dll missing from health check (12/12 reviewers agreed). The upstream cudart-llama-bin-win-cuda-X.Y-x64.zip bundle ships exactly three DLLs -- verified by downloading b9103's cuda-12.4 and cuda-13.1 artifacts directly: cudart64_X.dll + cublas64_X.dll + cublasLt64_X.dll, no executables. A user missing only cublasLt64_*.dll (partial deletion, antivirus quarantine, etc) still fails CUDA initialisation; the health check now requires all three when choice.runtime_name is set.

  2. Runtime overlay overly broad. The second copy_globs pass used runtime_patterns_for_choice(choice) which for windows-cuda is ["*.exe", "*.dll"]. A hypothetical malformed cudart zip containing llama-server.exe alongside the cudart DLLs would overwrite the main archive's server binary. Replaced with a new paired_runtime_dll_patterns(choice) helper that returns exactly the cudart bundle's three filename patterns. End-to-end test packs a runtime zip with a fake llama-server.exe and asserts the main binary survives.

  3. Installer / backend resolver asymmetry (7/12 + a related cu13 finding). python_runtime_dirs in install_llama_prebuilt.py walked nvidia/*/lib, nvidia/*/bin, torch/lib, while the backend's _windows_pip_nvidia_dll_dirs (from Studio: add torch's pip nvidia DLL dirs to PATH on Windows #5324) walks the broader Windows wheel layouts. Mirrored the six-path set so installer preflight CUDA detection now also covers:

    • nvidia/<pkg>/bin/x86_64 and nvidia/<pkg>/bin/x64 (current CUDA 13 layout used by the unsuffixed nvidia-cuda-runtime 13.x and nvidia-cublas 13.x packages; verified empirically by pip download nvidia-cuda-runtime --platform win_amd64 which produces nvidia/cu13/bin/x86_64/cudart64_13.dll).
    • nvidia/<pkg>/Library/bin (conda-style repacks) and its arch subdirs.

The lower-priority reviewer findings (P1 version-agnostic glob, P3 contradictory log line) were assessed as low-probability / cosmetic; deferred for a follow-up PR if you want them.

New tests:

  • test_paired_runtime_dll_patterns_excludes_executables -- gate on the helper.
  • test_runtime_overlay_cannot_overwrite_main_archive_payload -- end-to-end on install_from_archives with a malformed runtime zip.
  • test_python_runtime_dirs_covers_cu13_and_library_bin -- installer-side parity with the backend resolver.
  • Extended test_existing_install_matches_plan_windows_cuda_paired_requires_cudart with the cublasLt64_*.dll-missing case.

Regression: 263 install-suite tests pass (was 260 before this commit). Behavioural simulation harness in temp/pr_simulation/ exercises 35 scenarios (combined fresh-install, existing-install, fingerprint, health, PATH assembly, ZIP traversal, install-lock, resolver edge cases, cu13 layout, glob-meta safety, narrow-overlay regression) and all 35 pass.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7447ad558b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +5102 to +5103
"runtime_asset": choice.runtime_name,
"runtime_sha256": choice.runtime_sha256,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit runtime fingerprint fields to paired CUDA installs

expected_install_fingerprint now always includes runtime_asset and runtime_sha256, even when a choice has no paired runtime archive. Because metadata written before this commit used a fingerprint payload without those keys, every pre-existing install (including Linux/macOS and unpaired Windows bundles) will fail recorded_fingerprint != expected_fingerprint and be treated as stale, forcing unnecessary reinstall/validation work and avoidable network dependence. This cache invalidation should be scoped to paired Windows CUDA choices only.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen merged commit e346193 into main May 11, 2026
33 of 34 checks passed
@danielhanchen danielhanchen deleted the dh/fix-5106-windows-cudart-pair branch May 11, 2026 12:42
danielhanchen added a commit that referenced this pull request May 13, 2026
* Studio: pin GPU at 95% headroom and warn on silent CPU fallback

Two related runtime-side fixes for #5106 ("model
loaded fully on RAM instead of VRAM"):

1. GPU pin threshold bump 0.90 -> 0.95
-------------------------------------

``_select_gpus`` and the auto-ctx pin loop in ``start_llama_server``
used a ``pool * 0.90`` threshold to decide whether the model fits on
GPU. Models that needed 91-94% of free VRAM were classified as "does
not fit", so Studio set ``gpu_indices = None`` and shipped
``--fit on`` to llama-server without ``-ngl``. The unsloth
llama.cpp fork's ``--fit on`` then ran with its default
``--fit-target 1024`` (1 GiB margin per device, an upstream default
inherited from ggml-org#18679). On a tight fit where compute
buffers + CUDA context push the projected free below the 1 GiB
target, the fork's fit logic shaves layer weights off the GPU --
slow inference for users whose models would have loaded comfortably
with ``-ngl -1``.

The classic reproducer from #5106 (noahterbest's log):

    GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096,
    GPUs free: [(0, 22805)], selected: None, fit: True

20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model
fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit
mode. Bumping to 0.95 keeps these in the fits-on-GPU branch and
emits ``-ngl -1`` directly. The fork's ``--fit on`` still serves as
the safety net for the genuinely-too-large case.

The auto-ctx fallback also re-checks fit at 4096 before handing off
to ``--fit on``: a 20.8 GiB model with a 131072 native context fails
the auto loop at native ctx, falls back to ``min(4096, ctx)``, but
its weights + 4096 KV pin to the GPU comfortably. Without the
re-check we still emitted ``--fit on``.

``_fit_context_to_vram``'s 0.90 budget for context binary search is
intentionally left tighter than the pin fraction. That routine
chooses the slider value, where over-promising would OOM at runtime.
``_select_gpus`` decides whether to pin at all, where being
conservative pushes layers to CPU.

2. Belt-and-suspenders: warn on silent CPU fallback
---------------------------------------------------

After ``_wait_for_health`` succeeds, scan llama-server's stdout for
``model buffer size`` lines. If Studio detected GPUs and intended
GPU use but only CPU buffers were allocated, log a structured
warning citing #5106. Markers cover CUDA / ROCm / Metal / Vulkan /
OpenCL / SYCL backends. New ``_gpu_offload_active: Optional[bool]``
field surfaces the result for any future API consumer.

This catches runtime-load failures the install-time fix cannot
cover (cudart bundle pairing PR #5322 is the install-side
companion): user overriding ``--fit-target``, uncommon driver +
toolkit configurations, future regressions in the install path.

Tests: 10 new cases in studio/backend/tests/test_llama_cpp_context_fit.py:
* TestTightFitPinsToGPU x3: noahterbest's exact reproducer (auto and
  explicit ctx pins to GPU at 94%); guard against threshold over-
  broadening (genuine overflow still falls back to ``--fit on``).
* TestClassifyGpuOffload x7: CUDA / ROCm / Metal buffer markers
  return True; CPU-only buffer lines return False; absent buffer
  lines or no GPUs detected return None (no warning).

25 context-fit tests pass (15 baseline + 10 new). 511 tests total
across the affected test files. No regressions.

Refs #5106

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Trim comments to be more succinct

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
danielhanchen added a commit that referenced this pull request May 18, 2026
…tic CI test (#5376)

* tests/studio: end-to-end Windows GPU detection mock test (#5106)

Locks in the combined fix from #5322 + #5324 with a synthetic
Windows scenario that CI runners without GPUs can execute. The
test packs the real PyPI win_amd64 wheel layouts (cu12 modular and
the new unsuffixed cu13 nvidia/cu13/bin/x86_64 layout) plus the
exact filename set of the upstream b9103 cudart-llama-bin-win-cuda
bundles, then mocks nvidia-smi output and asserts that:

 * Studio's nvidia-smi probe parses the CSV and reports the GPU.
 * After PR #5322 the install_dir/build/bin/Release/ tree contains
   all three cudart bundle DLLs alongside llama-server.exe.
 * After PR #5324 the PATH built by start_llama_server's win32
   branch lists pip nvidia + torch/lib dirs in addition to the
   binary_dir.
 * cudart64_X.dll, cublas64_X.dll, and cublasLt64_X.dll are
   each reachable from at least one PATH entry, with cudart
   specifically reachable from BOTH the install dir and a pip
   nvidia dir (defence in depth).
 * Bare venvs without pip nvidia wheels still work via #5322's
   binary_dir drop; pre-#5322 installs still work via #5324's
   PATH augmentation.
 * A reconstructed pre-PR scenario (cudart absent from binary_dir
   and pip dirs not on PATH) leaves cudart unreachable, confirming
   the test would catch a future regression.

Bonus housekeeping in studio/install_llama_prebuilt.py: drop the
pointless f-prefix on the literal "llama-" in the
windows_cuda_attempts pairing guard (no behaviour change; lint
nit flagged in the post-merge review).

The mocks model real artifact contents I verified empirically:
 * pip download nvidia-cuda-runtime --platform win_amd64
   produces nvidia/cu13/bin/x86_64/cudart64_13.dll.
 * unzip on the b9103 cudart-llama-bin-win-cuda-13.1-x64.zip
   produces exactly cudart64_13.dll + cublas64_13.dll +
   cublasLt64_13.dll, no executables.
 * objdump -p on the b9103 ggml-cuda.dll shows a static PE
   import on cublas64_13.dll (the root cause of #5106 when
   cublas64_13.dll is unreachable).

Refs #5106 #5322 #5324

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test_5106_windows_gpu_detection_mock: don't shadow real httpx

This file's name sorts before every other file in studio/backend/tests/
(starts with the digit '5'), so pytest collects it first. The previous
``sys.modules.setdefault("httpx", _httpx_stub)`` ran before any other
test imported real httpx, which meant the stub permanently shadowed
the real module for the rest of the collection. Tests that did
``from httpx import HTTPError, Response`` (test_anthropic_messages,
test_browse_folders_route, test_training_*, etc) then failed at
collection with ``ImportError: cannot import name 'HTTPError'``
because the stub did not define those names. The existing
test_llama_cpp_windows_nvidia_path.py did not trigger the same issue
because it sorts after test_a* / test_b* / etc, by which point the
real httpx has already been imported and setdefault is a no-op.

Switch the stub installation to ``importlib.util.find_spec(name) is
None`` so we only fall back to the stub when the real module truly is
not installed. Backend CI installs httpx, structlog, and the
studio/backend/loggers package is reachable via the sys.path
augmentation a few lines above, so on CI all three find_spec calls
succeed and no stubs are installed at all.

Also add HTTPError and Response to the stub module for the offline
case, so anyone running this test outside CI with httpx absent still
gets a stub that satisfies the broader test suite's imports.

Refs #5106

* test_5106 + llama_cpp: extract win32 PATH helper and harden the regression test

Follow-up to PR #5376's review feedback. Three real findings from the
bot reviewers, plus one stale one.

1. (codex P2 line 201, gemini medium line 209) The regression test's
   _build_path_dirs_like_start_llama_server hand-copied the win32
   branch of LlamaCppBackend.start_llama_server, so a future drop or
   reorder of _windows_pip_nvidia_dll_dirs(sys.prefix) in production
   would have passed the test silently.

   Extract a new staticmethod LlamaCppBackend._build_windows_path_dirs
   (binary_dir, prefix, cuda_path). Production start_llama_server now
   calls this helper. The test's wrapper is reduced to a one-line
   delegate that forwards to the staticmethod, so the regression
   asserts against the exact production logic instead of a parallel
   copy of it.

2. (codex P2 line 245) test_nvidia_smi_probe_reports_synthetic_gpu did
   not clear CUDA_VISIBLE_DEVICES. On a shared GPU runner with the
   variable set in the parent shell, _get_gpu_free_memory() filters
   the mocked CSV and returns [] or falls through to the torch
   fallback. Cleared CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES
   via monkeypatch.delenv(..., raising=False).

3. (codex P2 line 66) _maybe_stub gated on importlib.util.find_spec
   ("loggers"), which returns a spec because studio/backend/loggers/
   is on sys.path. But the actual import chain loads
   loggers/handlers.py which does `from fastapi import Request,
   Response` at module load. In a lightweight env without fastapi
   installed, the stub never lands and `from core.inference.llama_cpp
   import LlamaCppBackend` raises during collection. Switched
   _maybe_stub to a real import attempt under try / except ImportError
   so the stub falls into place when the package is discoverable but
   not importable. CI has fastapi so this is purely a developer-
   machine ergonomics fix.

The fourth comment (codex P1 line 85 "Keep the httpx stub from leaking
across tests") was already addressed by 7437e73, which replaced the
unconditional sys.modules.setdefault with the find_spec-gated
_maybe_stub. No code change needed.

Production behaviour is unchanged: _build_windows_path_dirs returns
exactly the same ordering start_llama_server used inline
([binary_dir, *pip_dirs, cuda_bin?, cuda_bin_x64?]).

Verification (run inside studio/backend):
  pytest tests/test_5106_windows_gpu_detection_mock.py -v
    -> 10 passed
  pytest tests/test_llama_cpp_*.py tests/test_llama_server_args.py
       tests/test_5106_windows_gpu_detection_mock.py -q
    -> 171 passed
  CUDA_VISIBLE_DEVICES=1 pytest tests/test_5106_windows_gpu_detection_mock.py::TestWindowsGpuDetectionAfter5106Fix::test_nvidia_smi_probe_reports_synthetic_gpu
    -> 1 passed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rename Windows GPU detection test to a generic filename and trim comments

- studio/backend/tests/test_5106_windows_gpu_detection_mock.py
  -> studio/backend/tests/test_windows_gpu_detection_mock.py
  The file is the generic regression suite for Windows GPU detection;
  encoding the issue number in the filename is noise.
- Shorten module docstring, helper docstrings, per-test docstrings and
  inline comments in the renamed test file. No behaviour change,
  all 10 cases still pass.
- Shorten the _build_windows_path_dirs docstring in
  studio/backend/core/inference/llama_cpp.py and update the test-path
  reference; trim the win32 call-site comment to one line.

Local verification:
- pytest studio/backend/tests/test_windows_gpu_detection_mock.py -- 10 passed.
- pytest studio/backend/tests/test_llama_cpp_windows_nvidia_path.py
  studio/backend/tests/test_llama_server_args.py
  studio/backend/tests/test_windows_gpu_detection_mock.py -- 110 passed.

* Studio: harden _wait_for_health against transient httpx ReadError

The probe loop in LlamaCppBackend._wait_for_health only caught
ConnectError and TimeoutException. On Windows, when llama-server.exe
accepts the TCP probe and then dies before sending HTTP headers, the
peer process RST closes the socket. httpx maps this to ReadError
("WinError 10054 -- An existing connection was forcibly closed by the
remote host"), which fell through the except clause and bubbled out of
_wait_for_health, the routes/inference.py load_model handler, and back
to /api/inference/load as an opaque 500.

The crash diagnostic Studio actually wants to surface lives on the
self._process.poll() branch at the top of the loop body: "llama-server
exited with code X. Output: ...". We never reached that branch on the
WinError 10054 path because the very first probe blew up.

Expand the except to also swallow ReadError and RemoteProtocolError so
the next 0.5-second iteration runs the poll() branch. Outcomes:
  * Process really died: structured exit-code + last-stdout log line.
  * Single transient probe blip: silently retried; load succeeds.

Adds studio/backend/tests/test_llama_cpp_wait_for_health.py with five
cases covering happy-path 200, transient ReadError + dead process,
RemoteProtocolError + dead process, ConnectError cycling until success,
and dead process before the first probe. The new cases would have
failed against the old except clause -- ReadError / RemoteProtocolError
would have propagated instead of returning False.

Found while triaging the Windows Studio GGUF CI flake on this PR's
5a6ddc3 push: llama-server.exe (b9203 prebuilt) crashed within 2.2 s of
launch on the GPU-less runner, and Studio reported "WinError 10054"
instead of an upstream-tag-attributable exit-code line.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant