[Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation#3356
Merged
danielhanchen merged 23 commits intoOct 14, 2025
Conversation
a912ef6 to
922b41b
Compare
Member
|
@mmathew23 @Datta0 Can you guys also review this - appreciate it :) |
mmathew23
reviewed
Oct 8, 2025
07ea7f8 to
48adee8
Compare
abiswas-realadvice
pushed a commit
to abiswas-realadvice/unsloth
that referenced
this pull request
May 14, 2026
…ltiple Quantizations and Automated Ollama Modelfile Creation (unslothai#3356) * GGUF conversion code + model to template mappers + chat template adds/fixes * syntax fixes * extract tokenizer from video processor * model file cleanup after multiple quantizations * flip is_vlm flag is mmproj has text only llama.cpp support for MLM * preserve processor files for merge operation * reinstate chr(92) * fixed starling mapping * ollama Modelfile from gguf for text models * specify bf16 ollama model precision for vision models * fix keyError in templatedict when no mapping * revert chat_templates.py to original syntax * ollama modelfile template to model mapper * link save to ollama mapper, fix some bugs * rename to ollama_template_mappers * Remove old template_mappers file (renamed ollama_template_mappers) * fix final printout * fix model list and printout * remove yi base model, keep chat/instruct * fixed dangling > in HF repo readme for uploaded models * added granite model ollama support * Combine use_local_gguf() blocks * model_name relative to base_model_name
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PROBLEM
Depends on unslothai/unsloth-zoo#302
The existing GGUF conversion system was non-functional, due to upstream changes in llama.cpp and broken llama.cpp integration. Users encountered critical issues when trying to convert fine-tuned models to GGUF format for deployment:
get_chat_template()as a prerequisite stepFROM,TEMPLATE) and failed when usingollama createSOLUTION
Two-Stage Conversion Architecture
Two-stage conversion approach that separates high-precision base conversion from multi-target quantization:
Critical fix: Updated first-conversion precision logic as new llama.cpp versions no longer support requantizing from q8_0 format, preventing conversion failures with recent llama.cpp builds.
Full llama.cpp Quantization Support with Multi-Format Processing
Extended quantization method support to all quantization formats available in llama.cpp. Users can also specify multiple quantization formats in single operations:
The system performs the expensive initial conversion once, then generates all quantization variants from the intermediate representation, eliminating redundant processing and significantly reducing storage overhead and conversion time.
Modular llama.cpp Integration with Orchestrated Pipeline
Code now uses clean modular integration. The new
save_to_gguf()function serves as the main orchestrator, delegating specialized operations tounsloth_zoo.llama_cppmodules:check_llama_cpp()_download_convert_hf_to_gguf()convert_to_gguf()quantize_gguf()Enhanced Save Functions with Comprehensive Metadata
Redesigned
save_pretrained_gguf():Restructured
push_to_hub_gguf():save_pretrained_gguf()first, then systematically uploads resultsAutomated Ollama Modelfile Creation
Template-to-Model Mapping System:
Introduced systematic model-to-template association via
TEMPLATE_TO_MODEL_MAPPERandMODEL_TO_TEMPLATE_MAPPER. This eliminates the need for users to manually callget_chat_template()as a precondition, enabling automatic selection of appropriate chat templates for Ollama Modelfile generation based on model architecture.Template Fixes and Additions:
FROMandTEMPLATEdirectives in broken Ollama templates for gpt-oss, qwen3, and Gemma3n architecturesollama createwithout manual interventionDependency Resolution and Architectural Improvements
Eliminated Circular Imports:
Relocated
CHAT_TEMPLATESfromchat_templates.pyto dedicatedtemplate_mappers.pymodule, to allow calls from bothsave.pyandchat_templates.pywhile avoiding circular import failure errors.Testing
Multiple testing rounds during development and after initial branch commit to fork and final commit before PR.
Testing branches: https://github.com/unslothai/rolandtannous/unsloth-zoo@fix/llamacpp-compatibility-gguf-conversion and https://github.com/unslothai/rolandtannous/unsloth@fix/llamacpp-compatibility-gguf-conversion
End to End Testing:
llama-clifor text models andllama-mtmd-clifor multimodalsollama run model-nameModels Tested:
gptoss, llama3.1, llama3.2, Pixtral , Gemma3n, Gemma3, Gemma2, Qwen2, Qwen2.5, Qwen3, Mistral and Phi models
Also tested gpt-oss-20 on colab T4 . Link to notebook
Solves
#3348
#3297
#3090
#3229
#3215
#3202
#3194
#3133
#3124
#3040
#2984
#2950
#2860
#2667
#2580
#2526
#2478
#2399
#2370
#2365
#2360
#2326
#2321
#2290
#2209
#2193
#2115
#2058
#2007
#1917
#1905
#1903
#1846
#1781
#1729
#1721
#1645
#1610
#1546
#1504
#965
#835
#748
#785
#2098
#3050