[Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation by rolandtannous · Pull Request #3356 · unslothai/unsloth

rolandtannous · 2025-09-23T00:40:16Z

PROBLEM

Depends on unslothai/unsloth-zoo#302
The existing GGUF conversion system was non-functional, due to upstream changes in llama.cpp and broken llama.cpp integration. Users encountered critical issues when trying to convert fine-tuned models to GGUF format for deployment:

llama.cpp installation process frequently failed
Users were sometimes limited to only a few basic quantization types
Ollama Modelfile creation required users to manually call get_chat_template() as a prerequisite step
Several Ollama chat templates were missing required Modelfile directives (FROM, TEMPLATE) and failed when using ollama create

SOLUTION

Two-Stage Conversion Architecture

Two-stage conversion approach that separates high-precision base conversion from multi-target quantization:

Stage one: converts models to optimal intermediate precision formats (f32/f16/bf16) using convert-hf-to-gguf.py.
Stage two: applies llama-quantize for precise quantization to all requested formats.

Critical fix: Updated first-conversion precision logic as new llama.cpp versions no longer support requantizing from q8_0 format, preventing conversion failures with recent llama.cpp builds.

Full llama.cpp Quantization Support with Multi-Format Processing

Extended quantization method support to all quantization formats available in llama.cpp. Users can also specify multiple quantization formats in single operations:

quantization_method=["q8_0", "q4_k_m", "q5_k_m", "q2_k"]

The system performs the expensive initial conversion once, then generates all quantization variants from the intermediate representation, eliminating redundant processing and significantly reducing storage overhead and conversion time.

Modular llama.cpp Integration with Orchestrated Pipeline

Code now uses clean modular integration. The new save_to_gguf() function serves as the main orchestrator, delegating specialized operations to unsloth_zoo.llama_cpp modules:

Installation verification via check_llama_cpp()
Converter preparation via _download_convert_hf_to_gguf()
Initial conversion via convert_to_gguf()
Multi-quantization via quantize_gguf()

Enhanced Save Functions with Comprehensive Metadata

Redesigned save_pretrained_gguf():

Returns comprehensive metadata dictionary containing all conversion results, file locations, and model characteristics
VLM Detection: Automatic detection of Vision-Language Models with proper dual-file handling (model.gguf + mmproj.gguf)
GPT-OSS Support: Special handling for GPT-OSS architecture models requiring different conversion paths
Smart First-Conversion Selection: Automatically chooses optimal intermediate format based on target quantizations and hardware capabilities

Restructured push_to_hub_gguf():

Leverages Local Conversion: Calls save_pretrained_gguf() first, then systematically uploads results
Proper File Naming: Handles temporary directories and ensures correct model naming for Hub upload
Comprehensive Upload: Automatically uploads GGUF files, config.json, README.md, and Ollama Modelfile
Enhanced Error Handling: Improved error messages and cleanup procedures for failed upload operations

Automated Ollama Modelfile Creation

Template-to-Model Mapping System:
Introduced systematic model-to-template association via TEMPLATE_TO_MODEL_MAPPER and MODEL_TO_TEMPLATE_MAPPER. This eliminates the need for users to manually call get_chat_template() as a precondition, enabling automatic selection of appropriate chat templates for Ollama Modelfile generation based on model architecture.

Template Fixes and Additions:

Fixed missing FROM and TEMPLATE directives in broken Ollama templates for gpt-oss, qwen3, and Gemma3n architectures
Added new chat templates (Starling, Yi-chat) with proper Ollama formatting
Ensures all generated Modelfiles are immediately compatible with ollama create without manual intervention

Dependency Resolution and Architectural Improvements

Eliminated Circular Imports:
Relocated CHAT_TEMPLATES from chat_templates.py to dedicated template_mappers.py module, to allow calls from both save.py and chat_templates.py while avoiding circular import failure errors.

Testing

Multiple testing rounds during development and after initial branch commit to fork and final commit before PR.
Testing branches: https://github.com/unslothai/rolandtannous/unsloth-zoo@fix/llamacpp-compatibility-gguf-conversion and https://github.com/unslothai/rolandtannous/unsloth@fix/llamacpp-compatibility-gguf-conversion

End to End Testing:

Local and colab.
Tested both saving locally and pushing to hub
Tested and verified proper post-conversion inference usin llama.cpp llama-cli for text models and llama-mtmd-cli for multimodals
Tested creation of ollama models using generated Modelfile
Tested ollama model inference using ollama run model-name

Models Tested:

gptoss, llama3.1, llama3.2, Pixtral , Gemma3n, Gemma3, Gemma2, Qwen2, Qwen2.5, Qwen3, Mistral and Phi models

Also tested gpt-oss-20 on colab T4 . Link to notebook

Solves

#3348
#3297
#3090
#3229
#3215
#3202
#3194
#3133
#3124
#3040
#2984
#2950
#2860
#2667
#2580
#2526
#2478
#2399
#2370
#2365
#2360
#2326
#2321
#2290
#2209
#2193
#2115
#2058
#2007
#1917
#1905
#1903
#1846
#1781
#1729
#1721
#1645
#1610
#1546
#1504
#965
#835
#748
#785
#2098
#3050

danielhanchen

Nice work

danielhanchen · 2025-10-06T11:30:52Z

@mmathew23 @Datta0 Can you guys also review this - appreciate it :)

mmathew23

Few comments, thanks!

…/fixes

…ltiple Quantizations and Automated Ollama Modelfile Creation (unslothai#3356) * GGUF conversion code + model to template mappers + chat template adds/fixes * syntax fixes * extract tokenizer from video processor * model file cleanup after multiple quantizations * flip is_vlm flag is mmproj has text only llama.cpp support for MLM * preserve processor files for merge operation * reinstate chr(92) * fixed starling mapping * ollama Modelfile from gguf for text models * specify bf16 ollama model precision for vision models * fix keyError in templatedict when no mapping * revert chat_templates.py to original syntax * ollama modelfile template to model mapper * link save to ollama mapper, fix some bugs * rename to ollama_template_mappers * Remove old template_mappers file (renamed ollama_template_mappers) * fix final printout * fix model list and printout * remove yi base model, keep chat/instruct * fixed dangling > in HF repo readme for uploaded models * added granite model ollama support * Combine use_local_gguf() blocks * model_name relative to base_model_name

rolandtannous mentioned this pull request Sep 23, 2025

[Part 1] Complete llama.cpp Integration Overhaul with Enhanced Build System and Multi-Modal Support unslothai/unsloth-zoo#302

Merged

rolandtannous requested a review from danielhanchen September 23, 2025 01:05

danielhanchen reviewed Sep 24, 2025

View reviewed changes

Comment thread unsloth/save.py

danielhanchen reviewed Sep 24, 2025

View reviewed changes

Comment thread unsloth/save.py Outdated

danielhanchen requested changes Sep 24, 2025

View reviewed changes

rolandtannous force-pushed the fix/llamacpp-compatibility-gguf-conversion branch from a912ef6 to 922b41b Compare October 4, 2025 21:07

rolandtannous mentioned this pull request Oct 4, 2025

Prevent duplicate try blocks from being added by introducing a unique marker to identify already patched files #3395 #3409

Closed

mmathew23 reviewed Oct 8, 2025

View reviewed changes

Comment thread unsloth/models/mapper.py

Comment thread unsloth/save.py Outdated

Comment thread unsloth/save.py Outdated

Comment thread unsloth/save.py

rolandtannous added 20 commits October 13, 2025 20:18

GGUF conversion code + model to template mappers + chat template adds…

3f85625

…/fixes

syntax fixes

0a578d2

extract tokenizer from video processor

36857e3

model file cleanup after multiple quantizations

34a9097

flip is_vlm flag is mmproj has text only llama.cpp support for MLM

4987e2f

preserve processor files for merge operation

2a03248

reinstate chr(92)

d18280f

fixed starling mapping

79efd34

ollama Modelfile from gguf for text models

09fecd5

specify bf16 ollama model precision for vision models

455e8ee

fix keyError in templatedict when no mapping

fceb971

revert chat_templates.py to original syntax

dfa91c2

ollama modelfile template to model mapper

ac4ffc0

link save to ollama mapper, fix some bugs

9786979

rename to ollama_template_mappers

30a7024

Remove old template_mappers file (renamed ollama_template_mappers)

2120a38

fix final printout

d2a0d7d

fix model list and printout

48dd3f7

remove yi base model, keep chat/instruct

cd4126c

fixed dangling > in HF repo readme for uploaded models

0b8e9f7

rolandtannous added 3 commits October 13, 2025 20:18

added granite model ollama support

f0b8aeb

Combine use_local_gguf() blocks

3689446

model_name relative to base_model_name

48adee8

rolandtannous force-pushed the fix/llamacpp-compatibility-gguf-conversion branch from 07ea7f8 to 48adee8 Compare October 13, 2025 20:19

danielhanchen merged commit 05e91e7 into unslothai:main Oct 14, 2025

rolandtannous mentioned this pull request Nov 6, 2025

[Bug] Repeated insertion of try/catch into convert_hf_to_gguf.py results in syntax errors #3395

Closed

mmangkad mentioned this pull request Jan 15, 2026

Refactor Ollama template wiring and harden packing helpers #3890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation#3356

[Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation#3356
danielhanchen merged 23 commits into
unslothai:mainfrom
rolandtannous:fix/llamacpp-compatibility-gguf-conversion

rolandtannous commented Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

danielhanchen left a comment

Uh oh!

danielhanchen commented Oct 6, 2025

Uh oh!

mmathew23 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rolandtannous commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PROBLEM

SOLUTION

Two-Stage Conversion Architecture

Full llama.cpp Quantization Support with Multi-Format Processing

Modular llama.cpp Integration with Orchestrated Pipeline

Enhanced Save Functions with Comprehensive Metadata

Automated Ollama Modelfile Creation

Dependency Resolution and Architectural Improvements

Testing

End to End Testing:

Models Tested:

Solves

Uh oh!

Uh oh!

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Oct 6, 2025

Uh oh!

mmathew23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rolandtannous commented Sep 23, 2025 •

edited

Loading