Studio: download paired cudart bundle on Windows CUDA installs#5322
Conversation
Upstream ggml-org/llama.cpp publishes Windows CUDA in two archives that the release notes explicitly say are both required: llama-<tag>-bin-win-cuda-X.Y-x64.zip (binaries + ggml DLLs) cudart-llama-bin-win-cuda-X.Y-x64.zip (cudart64, cublas64, cublasLt64) Studio's installer was downloading only the first one. The ``runtime_name`` / ``runtime_url`` fields on AssetChoice existed but were never populated, and ``install_from_archives`` only handled ``choice.url``. With the cudart DLLs missing from ``install_dir/build/bin/Release``, the prebuilt binary's LoadLibrary calls only resolved at runtime when the user happened to have a version-matched system CUDA toolkit on PATH. That is the underlying cause for the Windows reports in #5106 ("GPU detected but model loaded entirely on RAM"): the prebuilt's CUDA backend silently fails to load and llama-server falls back to CPU regardless of ``-ngl`` or ``--fit on``. Wires the pairing through end to end: * ``windows_cuda_attempts`` and ``published_windows_cuda_attempts`` look up the matching ``cudart-llama-bin-win-cuda-X.Y-x64.zip`` asset URL alongside the main archive and store it as ``runtime_url`` / ``runtime_name`` on the AssetChoice. We only pair when the selected main archive is the binary archive (``llama-...zip``) so the legacy cudart-only naming path is unaffected. * ``apply_approved_hashes`` resolves the runtime archive's hash from the approved manifest. If the manifest does not list the runtime archive, the pairing is dropped rather than installing without checksum coverage. Preserves the supply-chain guarantee for published bundles; upstream installs with no manifest are unaffected (same risk surface as the existing main-archive download). * ``install_from_archives`` now downloads the runtime archive into a separate temp dir and runs ``copy_globs`` against both source dirs. Separate dirs avoid the "ambiguous archive layout" guard tripping on shared filenames like LICENSE.txt, while the second ``copy_globs`` overlay drops the cudart DLLs into the same ``install_dir/build/bin/Release`` directory as the main binary. Adds a ``runtime_sha256`` field on AssetChoice to carry the verified hash through to the download step, alongside the existing ``runtime_name`` / ``runtime_url`` slots. Tests: 5 new cases in tests/studio/install/test_selection_logic.py: * upstream pairing populates runtime_url / runtime_name * graceful degrade when cudart asset is absent in the release * legacy cudart-only naming path does not self-pair * apply_approved_hashes threads runtime_sha256 when the manifest lists it * apply_approved_hashes drops the pair when the runtime hash is missing rather than installing without verification 130 install tests pass (125 baseline + 5 new). No regressions. Refs #5106
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b2aeed578
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| runtime_name: str | None = None | ||
| runtime_url: str | None = None | ||
| runtime_sha256: str | None = None |
There was a problem hiding this comment.
Include runtime bundle in install cache validation
When a Windows CUDA choice now has a paired runtime archive, an existing install made before this change can still be considered up to date because expected_install_fingerprint()/expected_pairs only key off the main asset and runtime_line, while runtime_payload_health_groups() for windows-cuda only checks llama.dll and ggml-cuda.dll. In that scenario Studio skips the reinstall before install_from_archives() ever downloads cudart-llama...zip, leaving exactly the broken installs without cudart/cublas DLLs unaffected; include the runtime archive/hash in the fingerprint or explicitly require the paired DLLs so affected installs are refreshed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements support for paired runtime archives, specifically for Windows CUDA prebuilts, to ensure necessary DLLs like cudart and cublas are included during installation. The changes modify the AssetChoice structure, update asset selection logic to pair binary archives with their corresponding runtime bundles, and enhance the installation process to download and overlay these archives with checksum verification. New tests have been added to validate the pairing logic and hash verification. I have no feedback to provide.
expected_install_fingerprint did not hash the new runtime_name / runtime_sha256 fields, and runtime_payload_health_groups for windows- cuda only checked llama.dll / ggml-cuda.dll. The combination meant that an install made before this PR -- the exact installs reporting #5106 -- would still match the post-PR choice: same main asset name + sha, same llama.dll, same ggml-cuda.dll, missing cudart64_*.dll, but existing_install_matches_choice returned True and the cudart download path in install_from_archives never ran. Fresh installs got the fix; existing affected installs did not. This commit: * Adds runtime_asset and runtime_sha256 to the fingerprint payload so any change to (or first introduction of) the cudart pair invalidates pre-existing installs. * Refactors write_prebuilt_metadata to call expected_install_fingerprint so the recorded fingerprint cannot drift from the expected one when new keys are added. * Extends runtime_payload_health_groups for windows-cuda to require cudart64_*.dll and cublas64_*.dll *only when the choice carries a paired runtime archive*. Gating on choice.runtime_name keeps the no-pair fallback path (manifest missing cudart hash, upstream without paired bundle) from looping on reinstall. New tests: * test_existing_install_matches_plan_windows_cuda_paired_requires_cudart -- paired choice rejects installs missing cudart / cublas. * test_existing_install_matches_plan_windows_cuda_unpaired_skips_cudart_check -- unpaired choice still accepts legacy cudart-less installs. * test_existing_install_fingerprint_changes_when_cudart_pair_added -- direct fingerprint mismatch between the legacy and paired choice. Refs #5106
|
Pushed The mechanism in the PR was correct end-to-end -- The remaining hole was that an existing pre-PR install would still match the post-PR choice and skip the reinstall that drops the cudart DLLs in:
Combined, What the commit changes:
New tests in
Full suite stays green: Minor nit (not addressed): |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90431b9982
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| groups.append(["cudart64_*.dll"]) | ||
| groups.append(["cublas64_*.dll"]) |
There was a problem hiding this comment.
Require cublasLt DLL for paired Windows CUDA installs
When runtime_name is set, the health check only requires cudart64_*.dll and cublas64_*.dll, but not cublasLt64_*.dll even though the paired cudart bundle includes it. A partially deleted/corrupted install that is missing only cublasLt will still be treated as up-to-date and skip reinstall, which can leave CUDA backend initialization broken at runtime despite this commit’s paired-runtime validation path.
Useful? React with 👍 / 👎.
Three follow-ups from a 12-reviewer batch over 526894a (PR #5322): 1. (12/12) Health check required cudart64_*.dll and cublas64_*.dll but not cublasLt64_*.dll. The upstream cudart-llama-bin-win-cuda-X.Y-x64 bundle ships all three (verified against b9103 cuda-12.4 and cuda-13.1: 3 DLLs, no executables), and a Windows install missing any one of them still fails CUDA initialisation. Adding cublasLt64_*.dll to runtime_payload_health_groups so a partial install or a deletion of the third DLL triggers reinstall instead of silently staying broken. 2. The runtime overlay copy used the same broad runtime_patterns_for_choice set as the main archive (windows-cuda returns *.exe and *.dll). A malformed runtime zip that contained a llama-server.exe alongside the real cudart DLLs would have overwritten the main archive's server binary. Introduced paired_runtime_dll_patterns() that returns the cudart bundle's three specific filename patterns and nothing else, and use that for the second copy_globs pass. New end-to-end regression test packs a fake runtime zip with an extra llama-server.exe and asserts the main binary survives. 3. (7/12) python_runtime_dirs in install_llama_prebuilt.py and _windows_pip_nvidia_dll_dirs in llama_cpp.py walked different path sets. The installer side missed nvidia/<pkg>/Library/bin (conda layout) and nvidia/<pkg>/bin/x86_64 (current CUDA 13 unsuffixed wheel layout), so preflight CUDA detection could fail even when usable DLLs were present. Mirrored the same six-path set the backend resolver uses, including arch subdirs. New tests: - test_paired_runtime_dll_patterns_excludes_executables - test_runtime_overlay_cannot_overwrite_main_archive_payload (end-to-end) - test_python_runtime_dirs_covers_cu13_and_library_bin - extended test_existing_install_matches_plan_windows_cuda_paired_requires_cudart with a cublasLt-missing case Upstream cudart bundle contents verified empirically by downloading the b9103 release artifacts directly: each cuda-X.Y bundle contains exactly cudart64_X.dll + cublas64_X.dll + cublasLt64_X.dll, no exes. Refs #5106
for more information, see https://pre-commit.ci
|
Pushed
The lower-priority reviewer findings (P1 version-agnostic glob, P3 contradictory log line) were assessed as low-probability / cosmetic; deferred for a follow-up PR if you want them. New tests:
Regression: 263 install-suite tests pass (was 260 before this commit). Behavioural simulation harness in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7447ad558b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "runtime_asset": choice.runtime_name, | ||
| "runtime_sha256": choice.runtime_sha256, |
There was a problem hiding this comment.
Limit runtime fingerprint fields to paired CUDA installs
expected_install_fingerprint now always includes runtime_asset and runtime_sha256, even when a choice has no paired runtime archive. Because metadata written before this commit used a fingerprint payload without those keys, every pre-existing install (including Linux/macOS and unpaired Windows bundles) will fail recorded_fingerprint != expected_fingerprint and be treated as stale, forcing unnecessary reinstall/validation work and avoidable network dependence. This cache invalidation should be scoped to paired Windows CUDA choices only.
Useful? React with 👍 / 👎.
* Studio: pin GPU at 95% headroom and warn on silent CPU fallback Two related runtime-side fixes for #5106 ("model loaded fully on RAM instead of VRAM"): 1. GPU pin threshold bump 0.90 -> 0.95 ------------------------------------- ``_select_gpus`` and the auto-ctx pin loop in ``start_llama_server`` used a ``pool * 0.90`` threshold to decide whether the model fits on GPU. Models that needed 91-94% of free VRAM were classified as "does not fit", so Studio set ``gpu_indices = None`` and shipped ``--fit on`` to llama-server without ``-ngl``. The unsloth llama.cpp fork's ``--fit on`` then ran with its default ``--fit-target 1024`` (1 GiB margin per device, an upstream default inherited from ggml-org#18679). On a tight fit where compute buffers + CUDA context push the projected free below the 1 GiB target, the fork's fit logic shaves layer weights off the GPU -- slow inference for users whose models would have loaded comfortably with ``-ngl -1``. The classic reproducer from #5106 (noahterbest's log): GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096, GPUs free: [(0, 22805)], selected: None, fit: True 20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit mode. Bumping to 0.95 keeps these in the fits-on-GPU branch and emits ``-ngl -1`` directly. The fork's ``--fit on`` still serves as the safety net for the genuinely-too-large case. The auto-ctx fallback also re-checks fit at 4096 before handing off to ``--fit on``: a 20.8 GiB model with a 131072 native context fails the auto loop at native ctx, falls back to ``min(4096, ctx)``, but its weights + 4096 KV pin to the GPU comfortably. Without the re-check we still emitted ``--fit on``. ``_fit_context_to_vram``'s 0.90 budget for context binary search is intentionally left tighter than the pin fraction. That routine chooses the slider value, where over-promising would OOM at runtime. ``_select_gpus`` decides whether to pin at all, where being conservative pushes layers to CPU. 2. Belt-and-suspenders: warn on silent CPU fallback --------------------------------------------------- After ``_wait_for_health`` succeeds, scan llama-server's stdout for ``model buffer size`` lines. If Studio detected GPUs and intended GPU use but only CPU buffers were allocated, log a structured warning citing #5106. Markers cover CUDA / ROCm / Metal / Vulkan / OpenCL / SYCL backends. New ``_gpu_offload_active: Optional[bool]`` field surfaces the result for any future API consumer. This catches runtime-load failures the install-time fix cannot cover (cudart bundle pairing PR #5322 is the install-side companion): user overriding ``--fit-target``, uncommon driver + toolkit configurations, future regressions in the install path. Tests: 10 new cases in studio/backend/tests/test_llama_cpp_context_fit.py: * TestTightFitPinsToGPU x3: noahterbest's exact reproducer (auto and explicit ctx pins to GPU at 94%); guard against threshold over- broadening (genuine overflow still falls back to ``--fit on``). * TestClassifyGpuOffload x7: CUDA / ROCm / Metal buffer markers return True; CPU-only buffer lines return False; absent buffer lines or no GPUs detected return None (no warning). 25 context-fit tests pass (15 baseline + 10 new). 511 tests total across the affected test files. No regressions. Refs #5106 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Trim comments to be more succinct --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…tic CI test (#5376) * tests/studio: end-to-end Windows GPU detection mock test (#5106) Locks in the combined fix from #5322 + #5324 with a synthetic Windows scenario that CI runners without GPUs can execute. The test packs the real PyPI win_amd64 wheel layouts (cu12 modular and the new unsuffixed cu13 nvidia/cu13/bin/x86_64 layout) plus the exact filename set of the upstream b9103 cudart-llama-bin-win-cuda bundles, then mocks nvidia-smi output and asserts that: * Studio's nvidia-smi probe parses the CSV and reports the GPU. * After PR #5322 the install_dir/build/bin/Release/ tree contains all three cudart bundle DLLs alongside llama-server.exe. * After PR #5324 the PATH built by start_llama_server's win32 branch lists pip nvidia + torch/lib dirs in addition to the binary_dir. * cudart64_X.dll, cublas64_X.dll, and cublasLt64_X.dll are each reachable from at least one PATH entry, with cudart specifically reachable from BOTH the install dir and a pip nvidia dir (defence in depth). * Bare venvs without pip nvidia wheels still work via #5322's binary_dir drop; pre-#5322 installs still work via #5324's PATH augmentation. * A reconstructed pre-PR scenario (cudart absent from binary_dir and pip dirs not on PATH) leaves cudart unreachable, confirming the test would catch a future regression. Bonus housekeeping in studio/install_llama_prebuilt.py: drop the pointless f-prefix on the literal "llama-" in the windows_cuda_attempts pairing guard (no behaviour change; lint nit flagged in the post-merge review). The mocks model real artifact contents I verified empirically: * pip download nvidia-cuda-runtime --platform win_amd64 produces nvidia/cu13/bin/x86_64/cudart64_13.dll. * unzip on the b9103 cudart-llama-bin-win-cuda-13.1-x64.zip produces exactly cudart64_13.dll + cublas64_13.dll + cublasLt64_13.dll, no executables. * objdump -p on the b9103 ggml-cuda.dll shows a static PE import on cublas64_13.dll (the root cause of #5106 when cublas64_13.dll is unreachable). Refs #5106 #5322 #5324 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test_5106_windows_gpu_detection_mock: don't shadow real httpx This file's name sorts before every other file in studio/backend/tests/ (starts with the digit '5'), so pytest collects it first. The previous ``sys.modules.setdefault("httpx", _httpx_stub)`` ran before any other test imported real httpx, which meant the stub permanently shadowed the real module for the rest of the collection. Tests that did ``from httpx import HTTPError, Response`` (test_anthropic_messages, test_browse_folders_route, test_training_*, etc) then failed at collection with ``ImportError: cannot import name 'HTTPError'`` because the stub did not define those names. The existing test_llama_cpp_windows_nvidia_path.py did not trigger the same issue because it sorts after test_a* / test_b* / etc, by which point the real httpx has already been imported and setdefault is a no-op. Switch the stub installation to ``importlib.util.find_spec(name) is None`` so we only fall back to the stub when the real module truly is not installed. Backend CI installs httpx, structlog, and the studio/backend/loggers package is reachable via the sys.path augmentation a few lines above, so on CI all three find_spec calls succeed and no stubs are installed at all. Also add HTTPError and Response to the stub module for the offline case, so anyone running this test outside CI with httpx absent still gets a stub that satisfies the broader test suite's imports. Refs #5106 * test_5106 + llama_cpp: extract win32 PATH helper and harden the regression test Follow-up to PR #5376's review feedback. Three real findings from the bot reviewers, plus one stale one. 1. (codex P2 line 201, gemini medium line 209) The regression test's _build_path_dirs_like_start_llama_server hand-copied the win32 branch of LlamaCppBackend.start_llama_server, so a future drop or reorder of _windows_pip_nvidia_dll_dirs(sys.prefix) in production would have passed the test silently. Extract a new staticmethod LlamaCppBackend._build_windows_path_dirs (binary_dir, prefix, cuda_path). Production start_llama_server now calls this helper. The test's wrapper is reduced to a one-line delegate that forwards to the staticmethod, so the regression asserts against the exact production logic instead of a parallel copy of it. 2. (codex P2 line 245) test_nvidia_smi_probe_reports_synthetic_gpu did not clear CUDA_VISIBLE_DEVICES. On a shared GPU runner with the variable set in the parent shell, _get_gpu_free_memory() filters the mocked CSV and returns [] or falls through to the torch fallback. Cleared CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES via monkeypatch.delenv(..., raising=False). 3. (codex P2 line 66) _maybe_stub gated on importlib.util.find_spec ("loggers"), which returns a spec because studio/backend/loggers/ is on sys.path. But the actual import chain loads loggers/handlers.py which does `from fastapi import Request, Response` at module load. In a lightweight env without fastapi installed, the stub never lands and `from core.inference.llama_cpp import LlamaCppBackend` raises during collection. Switched _maybe_stub to a real import attempt under try / except ImportError so the stub falls into place when the package is discoverable but not importable. CI has fastapi so this is purely a developer- machine ergonomics fix. The fourth comment (codex P1 line 85 "Keep the httpx stub from leaking across tests") was already addressed by 7437e73, which replaced the unconditional sys.modules.setdefault with the find_spec-gated _maybe_stub. No code change needed. Production behaviour is unchanged: _build_windows_path_dirs returns exactly the same ordering start_llama_server used inline ([binary_dir, *pip_dirs, cuda_bin?, cuda_bin_x64?]). Verification (run inside studio/backend): pytest tests/test_5106_windows_gpu_detection_mock.py -v -> 10 passed pytest tests/test_llama_cpp_*.py tests/test_llama_server_args.py tests/test_5106_windows_gpu_detection_mock.py -q -> 171 passed CUDA_VISIBLE_DEVICES=1 pytest tests/test_5106_windows_gpu_detection_mock.py::TestWindowsGpuDetectionAfter5106Fix::test_nvidia_smi_probe_reports_synthetic_gpu -> 1 passed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename Windows GPU detection test to a generic filename and trim comments - studio/backend/tests/test_5106_windows_gpu_detection_mock.py -> studio/backend/tests/test_windows_gpu_detection_mock.py The file is the generic regression suite for Windows GPU detection; encoding the issue number in the filename is noise. - Shorten module docstring, helper docstrings, per-test docstrings and inline comments in the renamed test file. No behaviour change, all 10 cases still pass. - Shorten the _build_windows_path_dirs docstring in studio/backend/core/inference/llama_cpp.py and update the test-path reference; trim the win32 call-site comment to one line. Local verification: - pytest studio/backend/tests/test_windows_gpu_detection_mock.py -- 10 passed. - pytest studio/backend/tests/test_llama_cpp_windows_nvidia_path.py studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_windows_gpu_detection_mock.py -- 110 passed. * Studio: harden _wait_for_health against transient httpx ReadError The probe loop in LlamaCppBackend._wait_for_health only caught ConnectError and TimeoutException. On Windows, when llama-server.exe accepts the TCP probe and then dies before sending HTTP headers, the peer process RST closes the socket. httpx maps this to ReadError ("WinError 10054 -- An existing connection was forcibly closed by the remote host"), which fell through the except clause and bubbled out of _wait_for_health, the routes/inference.py load_model handler, and back to /api/inference/load as an opaque 500. The crash diagnostic Studio actually wants to surface lives on the self._process.poll() branch at the top of the loop body: "llama-server exited with code X. Output: ...". We never reached that branch on the WinError 10054 path because the very first probe blew up. Expand the except to also swallow ReadError and RemoteProtocolError so the next 0.5-second iteration runs the poll() branch. Outcomes: * Process really died: structured exit-code + last-stdout log line. * Single transient probe blip: silently retried; load succeeds. Adds studio/backend/tests/test_llama_cpp_wait_for_health.py with five cases covering happy-path 200, transient ReadError + dead process, RemoteProtocolError + dead process, ConnectError cycling until success, and dead process before the first probe. The new cases would have failed against the old except clause -- ReadError / RemoteProtocolError would have propagated instead of returning False. Found while triaging the Windows Studio GGUF CI flake on this PR's 5a6ddc3 push: llama-server.exe (b9203 prebuilt) crashed within 2.2 s of launch on the GPU-less runner, and Studio reported "WinError 10054" instead of an upstream-tag-attributable exit-code line. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
Summary
Fixes the Windows half of #5106 (multiple users reporting "GPU detected but model loads entirely on RAM/CPU"). Studio's installer was downloading only the main
llama-<tag>-bin-win-cuda-X.Y-x64.ziparchive, but upstream ggml-org/llama.cpp ships Windows CUDA in two archives that the release notes explicitly require both of:llama-<tag>-bin-win-cuda-X.Y-x64.zip(binaries + ggml DLLs)cudart-llama-bin-win-cuda-X.Y-x64.zip(cudart64_X.dll, cublas64_X.dll, cublasLt64_X.dll)The
AssetChoice.runtime_name/runtime_urlfields existed but were never populated, andinstall_from_archivesonly handledchoice.url. With cudart DLLs missing frominstall_dir/build/bin/Release, the prebuilt'sLoadLibrarycalls only resolved when the user happened to have a version-matched system CUDA toolkit on PATH. That is whynvidia-smireported the GPU and Studio's nvidia-smi probe reported free VRAM, but llama-server's CUDA backend silently failed to initialize and the model loaded entirely on CPU regardless of-nglor--fit on.What changed
windows_cuda_attemptsandpublished_windows_cuda_attemptsnow look up the pairedcudart-llama-bin-win-cuda-X.Y-x64.zipURL alongside the main archive and store it asruntime_url/runtime_nameon the AssetChoice. Pairing only happens when the selected main archive is the binary archive (name starts withllama-); the legacy cudart-only naming path stays intact.apply_approved_hashesresolves the runtime archive's hash from the approved manifest. If the manifest does not list the runtime archive, the pairing is dropped rather than installing an unverified runtime. Published bundles keep their supply-chain guarantee; upstream installs without a manifest sit on the same risk surface as the existing main-archive download.install_from_archivesdownloads the runtime archive into its own temp dir and runscopy_globsagainst both source dirs. Separate dirs avoid the "ambiguous archive layout" guard tripping on shared filenames likeLICENSE.txt, while the secondcopy_globsoverlay drops the cudart DLLs alongsidellama-server.exeininstall_dir/build/bin/Release.runtime_sha256field onAssetChoiceto carry the verified hash through to the download step.Test plan
python -m pytest tests/studio/install/test_selection_logic.py(130 passed: 125 baseline + 5 new)python -m pytest tests/studio/install/test_install_llama_prebuilt_logic.py tests/studio/install/test_pr4562_bugfixes.py tests/studio/install/test_llama_pr_force_and_source.py(all green)nvidia-smishows VRAM usage during inferencewinget install Nvidia.CUDANew tests added in
tests/studio/install/test_selection_logic.py:test_cudart_runtime_archive_is_paired-- upstream attempts populateruntime_urltest_no_runtime_archive_when_cudart_absent-- graceful degrade when cudart asset is missing in the releasetest_cudart_only_assets_do_not_self_pair-- legacy cudart-only naming path stays unchangedtest_runtime_hash_threaded_when_present--apply_approved_hashesresolvesruntime_sha256from the manifesttest_runtime_pair_dropped_when_hash_missing-- pairing is dropped when the manifest lacks the runtime hashRefs #5106