Skip to content

[Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation#3356

Merged
danielhanchen merged 23 commits into
unslothai:mainfrom
rolandtannous:fix/llamacpp-compatibility-gguf-conversion
Oct 14, 2025
Merged

[Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation#3356
danielhanchen merged 23 commits into
unslothai:mainfrom
rolandtannous:fix/llamacpp-compatibility-gguf-conversion

Conversation

@rolandtannous

@rolandtannous rolandtannous commented Sep 23, 2025

Copy link
Copy Markdown
Contributor

PROBLEM

Depends on unslothai/unsloth-zoo#302
The existing GGUF conversion system was non-functional, due to upstream changes in llama.cpp and broken llama.cpp integration. Users encountered critical issues when trying to convert fine-tuned models to GGUF format for deployment:

  • llama.cpp installation process frequently failed
  • Users were sometimes limited to only a few basic quantization types
  • Ollama Modelfile creation required users to manually call get_chat_template() as a prerequisite step
  • Several Ollama chat templates were missing required Modelfile directives (FROM, TEMPLATE) and failed when using ollama create

SOLUTION

Two-Stage Conversion Architecture

Two-stage conversion approach that separates high-precision base conversion from multi-target quantization:

  • Stage one: converts models to optimal intermediate precision formats (f32/f16/bf16) using convert-hf-to-gguf.py.
  • Stage two: applies llama-quantize for precise quantization to all requested formats.

Critical fix: Updated first-conversion precision logic as new llama.cpp versions no longer support requantizing from q8_0 format, preventing conversion failures with recent llama.cpp builds.

Full llama.cpp Quantization Support with Multi-Format Processing

Extended quantization method support to all quantization formats available in llama.cpp. Users can also specify multiple quantization formats in single operations:

quantization_method=["q8_0", "q4_k_m", "q5_k_m", "q2_k"]

The system performs the expensive initial conversion once, then generates all quantization variants from the intermediate representation, eliminating redundant processing and significantly reducing storage overhead and conversion time.

Modular llama.cpp Integration with Orchestrated Pipeline

Code now uses clean modular integration. The new save_to_gguf() function serves as the main orchestrator, delegating specialized operations to unsloth_zoo.llama_cpp modules:

  1. Installation verification via check_llama_cpp()
  2. Converter preparation via _download_convert_hf_to_gguf()
  3. Initial conversion via convert_to_gguf()
  4. Multi-quantization via quantize_gguf()

Enhanced Save Functions with Comprehensive Metadata

Redesigned save_pretrained_gguf():

  • Returns comprehensive metadata dictionary containing all conversion results, file locations, and model characteristics
  • VLM Detection: Automatic detection of Vision-Language Models with proper dual-file handling (model.gguf + mmproj.gguf)
  • GPT-OSS Support: Special handling for GPT-OSS architecture models requiring different conversion paths
  • Smart First-Conversion Selection: Automatically chooses optimal intermediate format based on target quantizations and hardware capabilities

Restructured push_to_hub_gguf():

  • Leverages Local Conversion: Calls save_pretrained_gguf() first, then systematically uploads results
  • Proper File Naming: Handles temporary directories and ensures correct model naming for Hub upload
  • Comprehensive Upload: Automatically uploads GGUF files, config.json, README.md, and Ollama Modelfile
  • Enhanced Error Handling: Improved error messages and cleanup procedures for failed upload operations

Automated Ollama Modelfile Creation

Template-to-Model Mapping System:
Introduced systematic model-to-template association via TEMPLATE_TO_MODEL_MAPPER and MODEL_TO_TEMPLATE_MAPPER. This eliminates the need for users to manually call get_chat_template() as a precondition, enabling automatic selection of appropriate chat templates for Ollama Modelfile generation based on model architecture.

Template Fixes and Additions:

  • Fixed missing FROM and TEMPLATE directives in broken Ollama templates for gpt-oss, qwen3, and Gemma3n architectures
  • Added new chat templates (Starling, Yi-chat) with proper Ollama formatting
  • Ensures all generated Modelfiles are immediately compatible with ollama create without manual intervention

Dependency Resolution and Architectural Improvements

Eliminated Circular Imports:
Relocated CHAT_TEMPLATES from chat_templates.py to dedicated template_mappers.py module, to allow calls from both save.py and chat_templates.py while avoiding circular import failure errors.

Testing

Multiple testing rounds during development and after initial branch commit to fork and final commit before PR.
Testing branches: https://github.com/unslothai/rolandtannous/unsloth-zoo@fix/llamacpp-compatibility-gguf-conversion and https://github.com/unslothai/rolandtannous/unsloth@fix/llamacpp-compatibility-gguf-conversion

End to End Testing:

  • Local and colab.
  • Tested both saving locally and pushing to hub
  • Tested and verified proper post-conversion inference usin llama.cpp llama-cli for text models and llama-mtmd-cli for multimodals
  • Tested creation of ollama models using generated Modelfile
  • Tested ollama model inference using ollama run model-name

Models Tested:

gptoss, llama3.1, llama3.2, Pixtral , Gemma3n, Gemma3, Gemma2, Qwen2, Qwen2.5, Qwen3, Mistral and Phi models

Also tested gpt-oss-20 on colab T4 . Link to notebook

Solves

#3348
#3297
#3090
#3229
#3215
#3202
#3194
#3133
#3124
#3040
#2984
#2950
#2860
#2667
#2580
#2526
#2478
#2399
#2370
#2365
#2360
#2326
#2321
#2290
#2209
#2193
#2115
#2058
#2007
#1917
#1905
#1903
#1846
#1781
#1729
#1721
#1645
#1610
#1546
#1504
#965
#835
#748
#785
#2098
#3050

@rolandtannous rolandtannous changed the title [Part1] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation [Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation Sep 23, 2025
Comment thread unsloth/save.py
Comment thread unsloth/save.py Outdated

@danielhanchen danielhanchen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work

@danielhanchen

Copy link
Copy Markdown
Member

@mmathew23 @Datta0 Can you guys also review this - appreciate it :)

@mmathew23 mmathew23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, thanks!

Comment thread unsloth/models/mapper.py
Comment thread unsloth/save.py Outdated
Comment thread unsloth/save.py Outdated
Comment thread unsloth/save.py
@rolandtannous rolandtannous force-pushed the fix/llamacpp-compatibility-gguf-conversion branch from 07ea7f8 to 48adee8 Compare October 13, 2025 20:19
@danielhanchen danielhanchen merged commit 05e91e7 into unslothai:main Oct 14, 2025
abiswas-realadvice pushed a commit to abiswas-realadvice/unsloth that referenced this pull request May 14, 2026
…ltiple Quantizations and Automated Ollama Modelfile Creation (unslothai#3356)

* GGUF conversion code + model to template mappers + chat template adds/fixes

* syntax fixes

* extract tokenizer from video processor

* model file cleanup after multiple quantizations

* flip is_vlm flag is mmproj has text only llama.cpp support for MLM

* preserve processor files for merge operation

* reinstate chr(92)

* fixed starling mapping

* ollama Modelfile from gguf for text models

* specify bf16 ollama model precision for vision models

* fix keyError in templatedict when no mapping

* revert chat_templates.py to original syntax

* ollama modelfile template to model mapper

* link save to ollama mapper, fix some bugs

* rename to ollama_template_mappers

* Remove old template_mappers file (renamed ollama_template_mappers)

* fix final printout

* fix model list and printout

* remove yi base model, keep chat/instruct

* fixed dangling > in HF repo readme for uploaded models

* added granite model ollama support

* Combine use_local_gguf() blocks

* model_name relative to base_model_name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants