Studio: add torch's pip nvidia DLL dirs to PATH on Windows#5324
Conversation
Studio's install_python_stack bundles torch with matching CUDA wheels (nvidia-cuda-runtime-cu13, nvidia-cublas-cu13, etc.) which ship cudart64_X.dll, cublas64_X.dll, and cublasLt64_X.dll under the prefix's Lib/site-packages/nvidia/<pkg>/(bin|Library/bin)/ tree. The Linux runtime env block in start_llama_server already pulls the equivalent nvidia/cu*/lib paths into LD_LIBRARY_PATH, but the Windows block did not do this, so the prebuilt llama-server.exe could not resolve cudart64_X.dll at runtime unless the user had a matching system CUDA toolkit on PATH. That is the root cause of the Windows reports in #5106 ("GPU detected but model loaded entirely on RAM/CPU"), and matches Roland's repeated workaround in that issue: install matching CUDA toolkit version. Brings the Windows env block in line with the Linux pattern: * New LlamaCppBackend._windows_pip_nvidia_dll_dirs resolver globs <prefix>/Lib/site-packages/nvidia/<pkg>/bin and <prefix>/Lib/site-packages/nvidia/<pkg>/Library/bin. Both layouts are seen in the wild across cuda_runtime / cublas / cudnn / nvjitlink wheels. * The Windows env block now extends path_dirs with the resolver's output before falling back to CUDA_PATH/bin, so pip-installed wheels are the canonical source (mirroring the Linux LD_LIBRARY_PATH ordering). System CUDA toolkit remains a valid fallback. Tests: 7 new cases in studio/backend/tests/test_llama_cpp_windows_nvidia_path.py: * empty resolver when no nvidia wheels installed * nvidia/<pkg>/bin layout resolved * nvidia/<pkg>/Library/bin layout resolved * mixed bin and Library/bin layouts both resolved * unrelated site-packages contents not walked * non-directory entries skipped * missing prefix does not raise 110 backend tests pass. No regressions. Refs #5106
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to locate and include CUDA DLL directories from pip-installed NVIDIA wheels on Windows, ensuring llama-server.exe can function without a system-wide CUDA installation. It adds a new static method _windows_pip_nvidia_dll_dirs to the LlamaCppBackend class and updates the load_model function to incorporate these paths. Additionally, a comprehensive test suite has been added to verify the directory resolution logic. The review feedback suggests refactoring the new method to use pathlib.Path instead of os.path to align with modern path handling practices in the codebase.
| import glob as _glob | ||
|
|
||
| nvidia_root = os.path.join(prefix, "Lib", "site-packages", "nvidia") | ||
| out: list[str] = [] | ||
| for pattern in ( | ||
| os.path.join(nvidia_root, "*", "bin"), | ||
| os.path.join(nvidia_root, "*", "Library", "bin"), | ||
| ): | ||
| for nv_dir in _glob.glob(pattern): | ||
| if os.path.isdir(nv_dir): | ||
| out.append(nv_dir) | ||
| return out |
There was a problem hiding this comment.
Refactoring to pathlib.Path is encouraged to align with the codebase's move toward modern path handling. Please ensure this change is consistent with the module's existing import patterns, particularly regarding the handling of local imports, as per the repository's file-level conventions.
nvidia_root = Path(prefix) / "Lib" / "site-packages" / "nvidia"
if not nvidia_root.is_dir():
return []
return [
str(p)
for pattern in ("*/bin", "*/Library/bin")
for p in nvidia_root.glob(pattern)
if p.is_dir()
]References
- Follow existing file-level conventions for imports, such as keeping certain imports inline if that is the established pattern in the file. (link)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: afb1f7cf59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| nvidia_root = os.path.join(prefix, "Lib", "site-packages", "nvidia") | ||
| out: list[str] = [] | ||
| for pattern in ( | ||
| os.path.join(nvidia_root, "*", "bin"), | ||
| os.path.join(nvidia_root, "*", "Library", "bin"), |
There was a problem hiding this comment.
Include torch/lib in Windows DLL search
On Windows installs where PyTorch's CUDA wheel bundles the runtime DLLs under Lib/site-packages/torch/lib rather than separate nvidia/*/bin wheels, this resolver returns no pip-provided DLL directories, so llama-server.exe still falls back to requiring a system CUDA toolkit. The installer already treats torch/lib as a Python runtime DLL source (install_llama_prebuilt.py::python_runtime_dirs), and setup.ps1 installs PyTorch CUDA wheels specifically because they bundle CUDA, so the runtime PATH construction should include that same location when the nvidia namespace dirs are absent.
Useful? React with 👍 / 👎.
PyTorch's Windows CUDA wheels frequently bundle cudart64_X.dll and cublas64_X.dll directly under Lib/site-packages/torch/lib/ instead of shipping separate nvidia-cuda-runtime-cuXX / nvidia-cublas-cuXX wheels. On those installs _windows_pip_nvidia_dll_dirs previously returned nothing useful, and llama-server.exe fell back to needing a system CUDA toolkit on PATH -- the original #5106 failure mode. The install-side equivalent python_runtime_dirs in install_llama_prebuilt.py already treats torch/lib as a Python runtime DLL source for the same reason. Bring the runtime resolver in parity so torch-bundled-CUDA installs find their cudart at llama-server start. Updates the existing test that codified the bug (asserted torch/lib was excluded), and adds three new cases: pickup, combined-with-nvidia, and the must-be-a-directory guard.
|
Pushed The mechanism in the PR was correct -- The hole was that PyTorch's Windows CUDA wheels frequently bundle What the commit changes:
Tests in
Full suite stays green: 10 cases here, 146 across Minor open items (not addressed):
|
Three follow-ups from a 12-reviewer batch over c1c8a07 (PR #5324): 1. The current nvidia-cuda-runtime (unsuffixed) 13.2.75 and nvidia-cublas 13.4.0.1 Windows wheels on PyPI ship under nvidia/cu13/bin/x86_64/cudart64_13.dll etc, not under nvidia/PKG/bin/. The previous resolver matched only one directory level past nvidia/PKG/ and silently missed the actual cu13 DLL location, leaving CUDA 13 users on the same failure mode as before #5106. Verified against: pip download nvidia-cuda-runtime --platform win_amd64 which produces nvidia/cu13/bin/x86_64/cudart64_13.dll. 2. glob.glob over sys.prefix interprets [ and ] as a character class. Valid Windows usernames / install paths can contain those characters (for example C:\Users\alice[work]\studio), so the previous resolver silently returned an empty list for such prefixes even when DLL dirs were present. 3. The resolver only ever returned nvidia/PKG/bin -- if both bin and bin/x86_64 exist (current wheels do), Windows DLL search should land on the arch-specific subdir first so the explicit cudart64_X.dll location wins. Rewritten as a pathlib.Path.iterdir walk to fix all three: no glob escaping needed, arch-specific subdirs added explicitly, and ordering puts bin/x86_64 before bin. Conda-style Library/bin/x86_64 and Library/bin/x64 are also covered for parity. A seen set dedupes when wheels happen to expose the same directory through multiple layouts. New tests: - test_picks_up_cu13_bin_x86_64_layout (the actual real-world cu13 case) - test_picks_up_bin_x64_layout - test_mixed_cu12_and_cu13_layouts - test_glob_meta_in_prefix_is_safe (bracket repro) - test_arch_subdir_listed_before_parent_bin (ordering) Verified empirically against PyPI: nvidia-cuda-runtime 13.2.75 -> nvidia/cu13/bin/x86_64/cudart64_13.dll nvidia-cublas 13.4.0.1 -> nvidia/cu13/bin/x86_64/cublas64_13.dll nvidia/cu13/bin/x86_64/cublasLt64_13.dll nvidia-cudnn-cu13 9.22.0.52 -> nvidia/cudnn/bin/cudnn64_9.dll (already covered) Refs #5106
for more information, see https://pre-commit.ci
|
Pushed
Reviewer-flagged but assessed and deferred:
New tests:
Regression: 15 cases in this file pass, 151 across |
…tic CI test (#5376) * tests/studio: end-to-end Windows GPU detection mock test (#5106) Locks in the combined fix from #5322 + #5324 with a synthetic Windows scenario that CI runners without GPUs can execute. The test packs the real PyPI win_amd64 wheel layouts (cu12 modular and the new unsuffixed cu13 nvidia/cu13/bin/x86_64 layout) plus the exact filename set of the upstream b9103 cudart-llama-bin-win-cuda bundles, then mocks nvidia-smi output and asserts that: * Studio's nvidia-smi probe parses the CSV and reports the GPU. * After PR #5322 the install_dir/build/bin/Release/ tree contains all three cudart bundle DLLs alongside llama-server.exe. * After PR #5324 the PATH built by start_llama_server's win32 branch lists pip nvidia + torch/lib dirs in addition to the binary_dir. * cudart64_X.dll, cublas64_X.dll, and cublasLt64_X.dll are each reachable from at least one PATH entry, with cudart specifically reachable from BOTH the install dir and a pip nvidia dir (defence in depth). * Bare venvs without pip nvidia wheels still work via #5322's binary_dir drop; pre-#5322 installs still work via #5324's PATH augmentation. * A reconstructed pre-PR scenario (cudart absent from binary_dir and pip dirs not on PATH) leaves cudart unreachable, confirming the test would catch a future regression. Bonus housekeeping in studio/install_llama_prebuilt.py: drop the pointless f-prefix on the literal "llama-" in the windows_cuda_attempts pairing guard (no behaviour change; lint nit flagged in the post-merge review). The mocks model real artifact contents I verified empirically: * pip download nvidia-cuda-runtime --platform win_amd64 produces nvidia/cu13/bin/x86_64/cudart64_13.dll. * unzip on the b9103 cudart-llama-bin-win-cuda-13.1-x64.zip produces exactly cudart64_13.dll + cublas64_13.dll + cublasLt64_13.dll, no executables. * objdump -p on the b9103 ggml-cuda.dll shows a static PE import on cublas64_13.dll (the root cause of #5106 when cublas64_13.dll is unreachable). Refs #5106 #5322 #5324 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test_5106_windows_gpu_detection_mock: don't shadow real httpx This file's name sorts before every other file in studio/backend/tests/ (starts with the digit '5'), so pytest collects it first. The previous ``sys.modules.setdefault("httpx", _httpx_stub)`` ran before any other test imported real httpx, which meant the stub permanently shadowed the real module for the rest of the collection. Tests that did ``from httpx import HTTPError, Response`` (test_anthropic_messages, test_browse_folders_route, test_training_*, etc) then failed at collection with ``ImportError: cannot import name 'HTTPError'`` because the stub did not define those names. The existing test_llama_cpp_windows_nvidia_path.py did not trigger the same issue because it sorts after test_a* / test_b* / etc, by which point the real httpx has already been imported and setdefault is a no-op. Switch the stub installation to ``importlib.util.find_spec(name) is None`` so we only fall back to the stub when the real module truly is not installed. Backend CI installs httpx, structlog, and the studio/backend/loggers package is reachable via the sys.path augmentation a few lines above, so on CI all three find_spec calls succeed and no stubs are installed at all. Also add HTTPError and Response to the stub module for the offline case, so anyone running this test outside CI with httpx absent still gets a stub that satisfies the broader test suite's imports. Refs #5106 * test_5106 + llama_cpp: extract win32 PATH helper and harden the regression test Follow-up to PR #5376's review feedback. Three real findings from the bot reviewers, plus one stale one. 1. (codex P2 line 201, gemini medium line 209) The regression test's _build_path_dirs_like_start_llama_server hand-copied the win32 branch of LlamaCppBackend.start_llama_server, so a future drop or reorder of _windows_pip_nvidia_dll_dirs(sys.prefix) in production would have passed the test silently. Extract a new staticmethod LlamaCppBackend._build_windows_path_dirs (binary_dir, prefix, cuda_path). Production start_llama_server now calls this helper. The test's wrapper is reduced to a one-line delegate that forwards to the staticmethod, so the regression asserts against the exact production logic instead of a parallel copy of it. 2. (codex P2 line 245) test_nvidia_smi_probe_reports_synthetic_gpu did not clear CUDA_VISIBLE_DEVICES. On a shared GPU runner with the variable set in the parent shell, _get_gpu_free_memory() filters the mocked CSV and returns [] or falls through to the torch fallback. Cleared CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES via monkeypatch.delenv(..., raising=False). 3. (codex P2 line 66) _maybe_stub gated on importlib.util.find_spec ("loggers"), which returns a spec because studio/backend/loggers/ is on sys.path. But the actual import chain loads loggers/handlers.py which does `from fastapi import Request, Response` at module load. In a lightweight env without fastapi installed, the stub never lands and `from core.inference.llama_cpp import LlamaCppBackend` raises during collection. Switched _maybe_stub to a real import attempt under try / except ImportError so the stub falls into place when the package is discoverable but not importable. CI has fastapi so this is purely a developer- machine ergonomics fix. The fourth comment (codex P1 line 85 "Keep the httpx stub from leaking across tests") was already addressed by 7437e73, which replaced the unconditional sys.modules.setdefault with the find_spec-gated _maybe_stub. No code change needed. Production behaviour is unchanged: _build_windows_path_dirs returns exactly the same ordering start_llama_server used inline ([binary_dir, *pip_dirs, cuda_bin?, cuda_bin_x64?]). Verification (run inside studio/backend): pytest tests/test_5106_windows_gpu_detection_mock.py -v -> 10 passed pytest tests/test_llama_cpp_*.py tests/test_llama_server_args.py tests/test_5106_windows_gpu_detection_mock.py -q -> 171 passed CUDA_VISIBLE_DEVICES=1 pytest tests/test_5106_windows_gpu_detection_mock.py::TestWindowsGpuDetectionAfter5106Fix::test_nvidia_smi_probe_reports_synthetic_gpu -> 1 passed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename Windows GPU detection test to a generic filename and trim comments - studio/backend/tests/test_5106_windows_gpu_detection_mock.py -> studio/backend/tests/test_windows_gpu_detection_mock.py The file is the generic regression suite for Windows GPU detection; encoding the issue number in the filename is noise. - Shorten module docstring, helper docstrings, per-test docstrings and inline comments in the renamed test file. No behaviour change, all 10 cases still pass. - Shorten the _build_windows_path_dirs docstring in studio/backend/core/inference/llama_cpp.py and update the test-path reference; trim the win32 call-site comment to one line. Local verification: - pytest studio/backend/tests/test_windows_gpu_detection_mock.py -- 10 passed. - pytest studio/backend/tests/test_llama_cpp_windows_nvidia_path.py studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_windows_gpu_detection_mock.py -- 110 passed. * Studio: harden _wait_for_health against transient httpx ReadError The probe loop in LlamaCppBackend._wait_for_health only caught ConnectError and TimeoutException. On Windows, when llama-server.exe accepts the TCP probe and then dies before sending HTTP headers, the peer process RST closes the socket. httpx maps this to ReadError ("WinError 10054 -- An existing connection was forcibly closed by the remote host"), which fell through the except clause and bubbled out of _wait_for_health, the routes/inference.py load_model handler, and back to /api/inference/load as an opaque 500. The crash diagnostic Studio actually wants to surface lives on the self._process.poll() branch at the top of the loop body: "llama-server exited with code X. Output: ...". We never reached that branch on the WinError 10054 path because the very first probe blew up. Expand the except to also swallow ReadError and RemoteProtocolError so the next 0.5-second iteration runs the poll() branch. Outcomes: * Process really died: structured exit-code + last-stdout log line. * Single transient probe blip: silently retried; load succeeds. Adds studio/backend/tests/test_llama_cpp_wait_for_health.py with five cases covering happy-path 200, transient ReadError + dead process, RemoteProtocolError + dead process, ConnectError cycling until success, and dead process before the first probe. The new cases would have failed against the old except clause -- ReadError / RemoteProtocolError would have propagated instead of returning False. Found while triaging the Windows Studio GGUF CI flake on this PR's 5a6ddc3 push: llama-server.exe (b9203 prebuilt) crashed within 2.2 s of launch on the GPU-less runner, and Studio reported "WinError 10054" instead of an upstream-tag-attributable exit-code line. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
Summary
Most direct fix for #5106 ("GPU detected but model loaded entirely on RAM/CPU" on Windows). This is the canonical fix per Studio's install design:
install_python_stackalready bundles torch with matching CUDA wheels (nvidia-cuda-runtime-cu13,nvidia-cublas-cu13, etc.) which shipcudart64_X.dll,cublas64_X.dll, andcublasLt64_X.dllunder<prefix>/Lib/site-packages/nvidia/<pkg>/(bin|Library/bin)/. The Linux runtime env block instart_llama_serveralready pulls the equivalentnvidia/cu*/libpaths intoLD_LIBRARY_PATH, but the Windows block did not. Without this, the prebuiltllama-server.execould not resolvecudart64_X.dllat runtime unless the user had a matching system CUDA toolkit on PATH, which is exactly the workaround Roland keeps recommending in #5106.What changed
LlamaCppBackend._windows_pip_nvidia_dll_dirs(prefix)resolver globs<prefix>/Lib/site-packages/nvidia/<pkg>/binand<prefix>/Lib/site-packages/nvidia/<pkg>/Library/bin. Both layouts are seen in the wild acrosscuda_runtime/cublas/cudnn/nvjitlinkwheels.start_llama_serverextendspath_dirswith the resolver output before falling back toCUDA_PATH/bin. Pip-installed wheels are the canonical source (mirrors the LinuxLD_LIBRARY_PATHordering); system CUDA toolkit remains a valid fallback.cudart-llama-bin-win-cuda-X.Y-x64.zipever gets renamed.How this relates to #5322 and #5323
dh/fix-5106-windows-nvidia-pip-path): aligns Windows with Linux's existing pattern. This is the canonical mechanism per the install design.dh/fix-5106-windows-cudart-pair): belt-and-suspenders -- downloads upstream's cudart bundle next tollama-server.exe. Useful when torch's nvidia wheels are absent (CPU-only torch,unsloth runstandalone, custom torch installs). Optional given this PR.dh/fix-5106-gpu-pin-threshold-and-cpu-fallback-warn): orthogonal runtime fix for the close-fit case + CPU-fallback diagnostic. Still needed even when CUDA loads fine.If you want a single canonical fix, this PR is sufficient for the Windows symptom. #5322 can be closed in favor of this one. #5323 is independent.
Test plan
python -m pytest studio/backend/tests/test_llama_cpp_windows_nvidia_path.py-- 7 new casespython -m pytest studio/backend/tests/test_llama_cpp_*.py studio/backend/tests/test_llama_server_args.py-- 110 passed, no regressionsnvidia-smishows VRAM usage during inference (nowinget install Nvidia.CUDAneeded)New tests in
studio/backend/tests/test_llama_cpp_windows_nvidia_path.py:test_returns_empty_when_no_nvidia_wheelstest_picks_up_bin_layouttest_picks_up_library_bin_layouttest_mixed_layouts_all_resolvedtest_does_not_walk_outside_nvidiatest_skips_non_directoriestest_missing_prefix_does_not_raiseRefs #5106