[Part 1] Complete llama.cpp Integration Overhaul with Enhanced Build System and Multi-Modal Support#302
Conversation
d0a9678 to
524ae5f
Compare
|
@mmathew23 @Datta0 Can you guys also review this - appreciate it :) |
|
@rolandtannous some conflicts |
|
mmathew23
left a comment
There was a problem hiding this comment.
Left a couple comments. I'm unclear on whether or not we need/should allow sudo. @danielhanchen Did you want to add that back?
Obviously a fairly significant change, so we should definitely ask others to help test on various envs.
I have tested it extensively on linux machines . I tested against 90% of the models you find in the new ollama template file but I haven't been able to test it on windows machines because I don't have access to one. |
29a5432 to
4b4190e
Compare
- Normalize UNSLOTH_LLAMA_CPP_SCRIPTS_DIR via abspath/expanduser/expandvars and switch the candidate check to os.path.isfile so a relative env var or a directory named convert_hf_to_gguf.py no longer corrupts the lru_cache key or surfaces as IsADirectoryError. - Return (path, mtime_ns, size) from the resolver so the lru_cache invalidates when a pinned local converter is updated in place. - Differentiate the patched output filename by a sha256 prefix of the cache key so distinct sources do not overwrite each other in LLAMA_CPP_DEFAULT_DIR while their cached metadata diverges. - Initialize text_archs and vision_archs alongside supported_types so introspection failures (e.g. text-only converters with no MMPROJ entry) produce a clean warning instead of an UnboundLocalError at frozenset(). - Use logger.info instead of print for the local-script notice so it follows the module's logging routing. - Branch error log and RuntimeError text on the actual source (local file vs network download) for clearer diagnostics. - Fix Github "report" comment typo to "repository". Blame-touched lines preserved in intent: - The error-logging lines whose blame points at 70f2ce7 ("[Part 1] Complete llama.cpp Integration Overhaul ... Multi-Modal Support (unslothai#302)") still log with exc_info=True and re-raise a RuntimeError chained from the original exception; only the source label is now accurate ("local file '...'" vs "network download") because the local-script branch of the same try block can also fail there. - The patched_filename construction whose blame points at b9daae8 ("feat: Add Windows native support and migrate llama.cpp to ~/.unsloth/ (unslothai#526)") still writes into LLAMA_CPP_DEFAULT_DIR using os.path.join with a Python-built filename; the lru_cache(2) introduced alongside the local-override feature would otherwise let two cache entries share one on-disk file, so the filename now carries a 12-char sha256 of the cache key whenever a local script is in play, leaving the network-only filename unchanged.
- check_llama_cpp now consults UNSLOTH_LLAMA_CPP_SCRIPTS_DIR before falling back to llama_cpp_folder, so the install-check matches what _download_convert_hf_to_gguf will actually use; the candidate probe also moves from os.path.exists to os.path.isfile so a directory named convert(-|_)hf_to_gguf.py is no longer accepted as a script. Error text now mentions both lookup locations.
- use_local_gguf gains an optional llama_cpp_dir argument; when omitted it prefers UNSLOTH_LLAMA_CPP_SCRIPTS_DIR's sibling gguf-py before falling back to LLAMA_CPP_DEFAULT_DIR, so a pinned llama.cpp checkout's matching gguf-py is loaded by the converter's `import gguf` instead of failing with ModuleNotFoundError.
- Revert the sha256-suffixed patched filename. _download_convert_hf_to_gguf_cached now writes to the stable LLAMA_CPP_DEFAULT_DIR/{name}.py in both local and network modes, restoring backwards compatibility with convert_to_gguf()'s default converter_location keyword and avoiding orphaned digest files when a pinned converter is updated in place. Severity downgraded P1 -> P2 by finding-verify with disposition REVERT_UNSUBSTANTIATED; the cross-source disk/cache mismatch the digest was guarding against is hypothetical because every in-tree caller (unsloth_repo/unsloth/save.py, studio export.py) passes the returned path explicitly. The hashlib import added for the digest is removed.
- Replace `Failed during {source}/introspection of original script` with `Failed during {source} (introspection of original script)` so a quoted file path inside source no longer reads as if /introspection were a path component.
Blame-touched lines preserved in intent:
- The `use_local_gguf` signature and docstring (lines 150-151) whose blame points at 70f2ce7 ("[Part 1] Complete llama.cpp Integration Overhaul ... Multi-Modal Support (unslothai#302)") still defines the same context manager with the same external contract; only an optional llama_cpp_dir keyword is added so callers passing nothing keep the prior behavior whenever UNSLOTH_LLAMA_CPP_SCRIPTS_DIR is unset.
- The `gguf_py_path = os.path.join(LLAMA_CPP_DEFAULT_DIR, "gguf-py")` assignment (line 155) whose blame points at b9daae8 ("feat: Add Windows native support and migrate llama.cpp to ~/.unsloth/ (unslothai#526)") still computes gguf_py_path via os.path.join under a llama.cpp directory; the only change is that the directory is selected from UNSLOTH_LLAMA_CPP_SCRIPTS_DIR when that env var points at a checkout containing a gguf-py/, falling back to LLAMA_CPP_DEFAULT_DIR otherwise, so the Windows native install layout is preserved.
- The `check_llama_cpp` converter-discovery block (the comment, the for-loop over [convert-hf-to-gguf.py, convert_hf_to_gguf.py], the os.path.join call, the candidate probe, the converter_location assignment, the break, and the trailing pass at lines 565-571) whose blame points at 70f2ce7 ("[Part 1] Complete llama.cpp Integration Overhaul ... Multi-Modal Support (unslothai#302)") is preserved as the fallback branch when UNSLOTH_LLAMA_CPP_SCRIPTS_DIR is unset; the only change there is os.path.exists -> os.path.isfile to match _resolve_local_convert_script and reject directories with the script's name. The wrapping `if local_convert_script is not None: ... else:` only adds a new short-circuit; the unchanged for-loop is the else branch.
- The "Failed to find converter script in {llama_cpp_folder}" RuntimeError (line 574) whose blame points at 70f2ce7 ("[Part 1] Complete llama.cpp Integration Overhaul ... Multi-Modal Support (unslothai#302)") still raises the same RuntimeError class under the same condition (converter_location is None); only the message string is extended to also mention UNSLOTH_LLAMA_CPP_SCRIPTS_DIR so a misconfigured user sees both lookup locations.
- convert_to_gguf builds an env that prepends UNSLOTH_LLAMA_CPP_SCRIPTS_DIR/gguf-py to PYTHONPATH (when that directory exists) and passes it via env= to both subprocess.run call sites. Without this, the parent process's sys.path mods made by use_local_gguf do not propagate into the converter's child interpreter, so a user that pinned a newer convert_hf_to_gguf.py via the env var ended up running it against the default-install or system-installed gguf and saw ModuleNotFoundError or version-mismatch failures. - use_local_gguf now also emits a logger.warning when UNSLOTH_LLAMA_CPP_SCRIPTS_DIR is set but the directory has no gguf-py/ subdirectory, mirroring the warning pattern in _resolve_local_convert_script. This surfaces the cross-version mismatch case that previously silently fell back to LLAMA_CPP_DEFAULT_DIR. - _resolve_local_convert_script's missing-converter warning now mentions both accepted filenames (convert_hf_to_gguf.py and convert-hf-to-gguf.py) so users debugging the env var see the full set the resolver actually probes. - Bump @lru_cache(2) on _download_convert_hf_to_gguf_cached to maxsize=8 so cross-mode toggling between local and network sources, plus a few in-place updates of a pinned converter, no longer evict cache slots and trigger redundant downloads or re-reads. Negligible memory overhead because the cached payload is small. - Reorder check_llama_cpp's fallback list to ["convert_hf_to_gguf.py", "convert-hf-to-gguf.py"] so it matches _resolve_local_convert_script's underscore-first preference; modern llama.cpp ships only the underscore variant, so this also avoids an unnecessary stat on the legacy hyphenated name in the common case. Blame-touched lines preserved in intent: - The print-output `subprocess.run(command, shell=False, check=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)` continuation line (line 1443, which carries the stdout=subprocess.PIPE, stderr=subprocess.STDOUT keyword arguments) whose blame points at 70f2ce7 ("[Part 1] Complete llama.cpp Integration Overhaul ... Multi-Modal Support (unslothai#302)") still passes the same shell=False/check=True/text=True/stdout=subprocess.PIPE/stderr=subprocess.STDOUT arguments in the same order; only `env=sub_env` is added at the end so the venv-friendly invocation also picks up the pinned gguf-py via PYTHONPATH instead of the default-install gguf. - The silent-output `subprocess.run(command, shell=False, check=True, capture_output=True)` line (line 1446) whose blame points at 6b68187 ("fix: use `sys.executable` for pip and python subprocesses to support venv environments (unslothai#503)") still passes the same command/shell=False/check=True/capture_output=True; only `env=sub_env` is added so the same sys.executable launch the original fix established also receives PYTHONPATH for the pinned gguf-py. - The `try`/`except subprocess.CalledProcessError` block surrounding both calls (whose blame points at 70f2ce7 "[Part 1] Complete llama.cpp Integration Overhaul ... Multi-Modal Support (unslothai#302)" and 71229d8 "Fix Mistral, Qwen (unslothai#36)") is unchanged: same RuntimeError type, same command-string formatting, same stdout-print-on-failure path.
Discovery: extract LLAMA_CPP_CONVERTER_FILENAMES module constant and use it from both _resolve_local_convert_script and check_llama_cpp so both paths agree on filename precedence. Switch check_llama_cpp from os.path.exists to os.path.isfile so a directory accidentally named convert_hf_to_gguf.py is no longer returned as a runnable converter. Network timeout: split requests.get timeout into a (connect, read) tuple of (5, 30) so connection-phase failures resolve in 5 seconds instead of waiting the full 30s read window. Diagnostics: replace the 'download or introspection' phrasing in the shared error path with 'loading or introspecting' so the message accurately covers both the local-file read branch and the network branch. The four touched lines (converter discovery loop, exists-vs-isfile check, and the two error strings) trace back via git blame to commit 70f2ce7 "[Part 1] Complete llama.cpp Integration Overhaul with Enhanced Build System and Multi-Modal Support (unslothai#302)". The original behavior is preserved: the same converter filenames are discovered (just via a shared constant), os.path.isfile is a strict subset of os.path.exists that only rejects directories misnamed like the converter (which cannot be executed anyway), and the wording change is purely diagnostic and now correctly covers the local-read branch added in this change set.
…621) * Allow local convert_hf_to_gguf.py via UNSLOTH_LLAMA_CPP_SCRIPTS_DIR Add _resolve_local_convert_script helper and an opt-in path that reads convert_hf_to_gguf.py from UNSLOTH_LLAMA_CPP_SCRIPTS_DIR instead of fetching master from GitHub. Lets callers (e.g. Studio) pin the convert script to the same llama.cpp ref as their llama-quantize binary and gguf-py, avoiding API drift. The original _download_convert_hf_to_gguf is now a thin wrapper that resolves the env var on every call and forwards the resolved path to the @lru_cache'd implementation as part of its cache key. Resolving inside the cached function would have left subsequent calls pinned to the first resolution, since lru_cache keys on the args passed (always the default) rather than values mutated inside the body. Default behavior (env var unset) is unchanged: download from master. * Allow local convert_hf_to_gguf.py via UNSLOTH_LLAMA_CPP_SCRIPTS_DIR Add _resolve_local_convert_script() helper and an opt-in path that reads convert_hf_to_gguf.py from UNSLOTH_LLAMA_CPP_SCRIPTS_DIR instead of fetching master from GitHub. Lets callers (e.g. Studio) pin the convert script to the same llama.cpp ref as their llama-quantize binary and gguf-py, avoiding API drift. The original _download_convert_hf_to_gguf is now a thin wrapper that resolves the env var on every call and forwards the resolved path to the @lru_cache'd implementation as part of its cache key. Resolving inside the cached function would have left subsequent calls pinned to the first resolution, since lru_cache keys on the args passed (always the default) rather than values mutated inside the body. Default behavior (env var unset) is unchanged: download from master. * Add unsloth_compiled_cache to gitignore * Harden local convert_hf_to_gguf.py override path _resolve_local_convert_script: normalize UNSLOTH_LLAMA_CPP_SCRIPTS_DIR via os.path.expanduser + os.path.abspath so the cache key on the cached implementation is stable across cwd changes; switch os.path.exists -> os.path.isfile so a directory or socket bearing the converter name no longer wins the lookup; return (abs_path, mtime_ns, size) so in-place edits to the same path invalidate the lru_cache entry. _download_convert_hf_to_gguf_cached: unpack the new tuple, replace the bare print with logger.info to match the rest of the function, and initialize text_archs and vision_archs alongside supported_types so a converter that lacks ModelType.MMPROJ (or where ModelBase / ModelType is missing entirely) no longer raises UnboundLocalError on the unconditional frozenset(...) calls. Re-attach cache_clear / cache_info from the cached implementation onto the public _download_convert_hf_to_gguf wrapper so the pre-existing lru_cache surface (in __all__) keeps working for external callers. * Polish local convert_hf_to_gguf.py override hardening _resolve_local_convert_script: wrap the per-candidate isfile/stat block in try/except OSError so a vanishing-during-update converter falls through to the next name (and ultimately the network fallback) instead of escaping a bare FileNotFoundError out of the resolver. Final fall-through warning now lists both convert_hf_to_gguf.py and convert-hf-to-gguf.py since the loop tries both names. Fix typo on the cached function comment: 'Github report' -> 'GitHub repository'. @lru_cache(2) -> @lru_cache(8): two slots is too small once the cache key includes (path, mtime_ns, size). Any second distinct local identity (post-update mtime change, distinct SCRIPTS_DIR path) would evict the (name, None) network slot and force a refetch on a later unset-env call. Eight slots cover realistic local-identity churn at trivial memory cost. Re-attach cache_parameters and __wrapped__ on the public _download_convert_hf_to_gguf wrapper so the full pre-existing lru_cache surface (cache_clear, cache_info, cache_parameters, __wrapped__) is preserved for external callers and inspect.unwrap. * Tighten convert_hf_to_gguf comments * Scrub leaked references from comments and string literals * Apply codex correction to scrub commit Reverts the cosmetic whitespace edits introduced by 8b4fd40 ("Scrub leaked references from comments and string literals") that did not correspond to any leak token and that damaged whitespace-significant content: - Restored canonical "PURPOSE. See" / "program. If not" double space in the two GPL/LGPL license headers. - Restored 4-space docstring continuations in _resolve_local_convert_script, _find_visual_studio, _find_openssl_root, _find_lib_path. - Restored the 12-space indent inside the bytes-literal carrying source code substituted via re.sub into convert_hf_to_gguf.py for the Qwen3MoE num_experts patch. The previous 1-space prefix would have produced invalid Python (IndentationError) at runtime in the patched file. The post-correction tree of unsloth_zoo/llama_cpp.py matches the parent commit 2ac599 (the last good state before the bad scrub) byte for byte. * Harden local convert_hf_to_gguf.py override Cache implementation: drop lru_cache from 8 entries to 1. The cached function returns a path to a single shared on-disk file at LLAMA_CPP_DEFAULT_DIR/{name}.py, which is overwritten by every cache miss. Holding more than one entry in memory therefore returns stale architecture metadata for any non-active key while the disk content reflects only the last miss. Local override resolution: raise RuntimeError when UNSLOTH_LLAMA_CPP_SCRIPTS_DIR is set but invalid (non-directory, missing converter, or stat failure). The previous warn-and-return-None silently fell back to the network download, defeating an explicit pin. Network timeout: add timeout=30 to requests.get on the converter URL, matching the existing timeout convention used elsewhere in the module. Compatibility shim: stop forwarding __wrapped__ from the cached inner function. The inner signature is (name, _local_script_info) while the public wrapper accepts only name, so inspect.unwrap(fn)() previously raised TypeError. cache_clear, cache_info, and cache_parameters are still forwarded for parity with the pre-split surface. * Unify llama.cpp converter discovery and tighten timeout/wording Discovery: extract LLAMA_CPP_CONVERTER_FILENAMES module constant and use it from both _resolve_local_convert_script and check_llama_cpp so both paths agree on filename precedence. Switch check_llama_cpp from os.path.exists to os.path.isfile so a directory accidentally named convert_hf_to_gguf.py is no longer returned as a runnable converter. Network timeout: split requests.get timeout into a (connect, read) tuple of (5, 30) so connection-phase failures resolve in 5 seconds instead of waiting the full 30s read window. Diagnostics: replace the 'download or introspection' phrasing in the shared error path with 'loading or introspecting' so the message accurately covers both the local-file read branch and the network branch. The four touched lines (converter discovery loop, exists-vs-isfile check, and the two error strings) trace back via git blame to commit 70f2ce7 "[Part 1] Complete llama.cpp Integration Overhaul with Enhanced Build System and Multi-Modal Support (#302)". The original behavior is preserved: the same converter filenames are discovered (just via a shared constant), os.path.isfile is a strict subset of os.path.exists that only rejects directories misnamed like the converter (which cannot be executed anyway), and the wording change is purely diagnostic and now correctly covers the local-read branch added in this change set. --------- Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: danielhanchen <info@unsloth.ai>
PROBLEM
Requires unslothai/unsloth#3356
llama.cpp integration in unsloth-zoo suffered from critical failures:
convert_to_gguf()function attempted to handle base conversion and quantization simultaneously, making it impossible to generate multiple quantization formats efficiently or adapt to llama.cpp's new requirement that quantization must occur from high-precision formats onlymistral_commonas dependency package broke llama.cppSOLUTION
Separated Quantization Pipeline
New two-function architecture:
convert_to_gguf(): Focuses solely on high-precision base conversion (f32/f16/bf16), handling model loading, architecture detection, and initial GGUF generationquantize_gguf(): Dedicated quantization function using llama-quantize binary, optimized for parallel processing with thread allocation based on CPU countThis separation enables the two-stage architecture where one expensive base conversion supports multiple quantization targets.
Dynamic Architecture Detection via Class Introspection
Previous regex-based architecture extraction broke when llama.cpp restructured its converter. New implementation uses proper class introspection:
ModelBase._model_classes[ModelType.TEXT]andModelBase._model_classes[ModelType.MMPROJ]Intelligent Permission and Dependency Management
Replaced hardcoded sudo operations with platform-aware permission handling. The new
check_build_requirements()function validates essential tools (gcc, cmake, curl, git) and libraries (libcurl4-openssl-dev) using multiple detection methods (which, pkg-config, curl-config).Cloud environments (Colab/Kaggle) receive automatic sudo elevation with notification, while local environments require explicit user consent. Missing packages are identified upfront with specific installation guidance. Build commands are run as regular user as they do not require sudo privileges.
GGUF Isolation with Context Management
Introduced
use_local_gguf()context manager to resolve conflicts between system GGUF and llama.cpp's extended GGUF implementation:This ensures llama.cpp's custom GGUF extensions (Mistral-specific classes, additional model support) are used during conversion, preventing import errors and missing functionality.
Multi-Modal Conversion Architecture
Enhanced
convert_to_gguf()now handles Vision-Language Models through dual conversion paths:Models with text-only llama.cpp support despite having vision capabilities now convert successfully instead of failing.
Enhanced Build System with Fallback Strategy
Use dual-strategy build system:
MISC
tokenizer.modelduring merging process insaving_utilsif applicable.TESTING
testing branches: Testing branches: https://github.com/unslothai/rolandtannous/unsloth-zoo@fix/llamacpp-compatibility-gguf-conversion and https://github.com/unslothai/rolandtannous/unsloth@fix/llamacpp-compatibility-gguf-conversion
Build System Validation
Architecture Detection
Multi-Modal Conversion
Environment Isolation
Models Tested
Comprehensive testing across model architectures: GPT-OSS, Llama 3.1/3.2, Pixtral (VLM), Gemma variants, Qwen series, Mistral, Phi models. Specific validation of GPT-OSS-20B on Colab T4. Notebook
End-to-End Validation
Solves
unslothai/unsloth#3348
unslothai/unsloth#3297
unslothai/unsloth#3090
unslothai/unsloth#3229
unslothai/unsloth#3215
unslothai/unsloth#3202
unslothai/unsloth#3194
unslothai/unsloth#3133
unslothai/unsloth#3124
unslothai/unsloth#3040
unslothai/unsloth#2984
unslothai/unsloth#2950
unslothai/unsloth#2860
unslothai/unsloth#2667
unslothai/unsloth#2580
unslothai/unsloth#2526
unslothai/unsloth#2478
unslothai/unsloth#2399
unslothai/unsloth#2370
unslothai/unsloth#2365
unslothai/unsloth#2360
unslothai/unsloth#2326
unslothai/unsloth#2321
unslothai/unsloth#2290
unslothai/unsloth#2209
unslothai/unsloth#2193
unslothai/unsloth#2115
unslothai/unsloth#2058
unslothai/unsloth#2007
unslothai/unsloth#1917
unslothai/unsloth#1905
unslothai/unsloth#1903
unslothai/unsloth#1846
unslothai/unsloth#1781
unslothai/unsloth#1729
unslothai/unsloth#1721
unslothai/unsloth#1645
unslothai/unsloth#1610
unslothai/unsloth#1546
unslothai/unsloth#1504
unslothai/unsloth#965
unslothai/unsloth#835
unslothai/unsloth#748
unslothai/unsloth#785
unslothai/unsloth#2098
unslothai/unsloth#3050