advanced-gguf-quantizer is a llama.cpp-derived, CUDA accelerated GGUF quantization toolkit for
creating, inspecting, evaluating, improving, and testing GGUF models. The first
focus initially was Blackwell NVFP4 and MXFP6, but now this quantizer has expanded
psat that. A BF16.gguf, imatrix and dataset, and kld file are all used together
to search and tune for the best possible combination of scales and mixed tensor types.
This tool is still in the absolute earliest stages of development and is still a rapidly changing work in progress. More features and improvements are still on the way.
A new strategy called Refined Scale Fit (RSF) is used on Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, and NVFP4 for changing the scales and re-fitting using imatrix supported data.
Logit KLD files are used during the process to tune and adjust through a series of candidate policies and different types of quantizations to determine which has the lowest error for that particular tensor and how the imatrix affects the ppl/kld data.
The tool will also self determine using a weighted score how to promote more error prone or sensitive tensors to higher bit GGUF types such as Q4_K, Q5_K, Q6_K, MXFP6 (not by default) or Q8_0 when the measured error profile justifies both bpw/size and speed tradeoff.
Models created with this tool are posted at: https://huggingface.co/michaelw9999
This repository keeps a limited set of the original llama.cpp tools:
llama-quantize: the primary command and preferred entry pointadvanced-gguf-quantizer: binary for new subcommands.llama-imatrix: calibration imatrix generation.llama-perplexity: PPL and saved-logit KLD reporterllama-completion: quickly tests produced GGUFs.
This tool requires the input GGUF files to be present locally. It does not download models, import Hugging Face checkpoints, or convert checkpoints.
This tool and MXFP6/E2M3 is experimental and is not supported by NVIDIA or llama.cpp. If official MXFP6 support appears later, the official format may change and GGUF models created here may not work with future runtimes.
MXFP6 is not enabled by default, but the tool can make MXFP6 quantizations and also use MXFP6 mixed with other tensor types in the model evaluation strategy.
Feedback is requested. The latest full CUDA MXFP6 branch can be installed from https://github.com/michaelw9999/llama.cpp/tree/mxfp6-cuda. The initial llama.cpp MXFP6 CPU only PR is located at ggml-org/llama.cpp#22671.
-
Designed to creates NVFP4, MXFP6, and mixed NVFP4/MXFP6 GGUFs with proper tensor and input scales.
-
Uses RSF (Refined Scale Fit) for supported K-quants and NVFP4 scale fitting.
-
Selects quantization types tensor-by-tensor based on set parameters/recipe or mode.
-
Offers
fast,normal, anddeepwork modes for quantizer search depth. -
Allows existing GGUFs to easily be patched to further improve them or swap tensor types.
-
Uses CUDA accelerated improvements for tensor encoding,
llama-perplexity,llama-imatrix, and in-memory runtime patch edits. -
Allows start and stop and resume on quantizing, creating imatrix, or saving logits.
-
Creates activation-aware quantization for NVFP4 using imatrix input scales.
-
Offers recipe and project files for reproducible runs.
-
Creates run artifacts: locked recipe, run log, tensor assignment log, manifest,
checkpoint-key.json, quantization report, and validation scripts. -
Checkpointed candidate search so interrupted long runs can resume safely.
-
In-place repair and edit feature for existing GGUFs.
-
p99 and p999 KLD gates in quality mode, visible in reports.
-
Fused decision units for grouped tensor choices.
-
Best-candidate reporting determined by defined quality and budget choices.
-
A text UI and wizard for guided recipe creation, monitoring, recovery, repair, and inspection. [Still needs improvement]
-
Agent-guided quantization in SKILLS.md for automated runs.
-
This README is still a work in progress. The tool has more features and improvements than are listed here.
Use a fresh build directory:
cmake -S . -B build -DGGML_CUDA=ON
cmake --build build -j 20Expected binaries are in build/bin/.
Inspect the source GGUF:
./build/bin/llama-quantize inspect models/source-bf16.ggufCreate a starting recipe:
./build/bin/advanced-gguf-quantizer recipe init --profile nvfp4 --output recipes/model.tomlEdit the recipe fields for your files:
[io]
input = "models/source-bf16.gguf"
output = "runs/model/model-nvfp4-mxfp6.gguf"
[calibration]
corpus = "data/calibration.txt"
imatrix = "runs/model/imatrix.dat"
[evaluation]
bf16_reference = "models/source-bf16.gguf"
corpus = "data/eval.txt"
kld_base = "runs/model/bf16.kld"
[target]
precision_mode = "nvfp4"
target_bpw = 4.8
[quantizer]
mode = "normal" # fast | normal | deepGenerate an imatrix:
./build/bin/llama-imatrix \
-m models/source-bf16.gguf \
-f data/calibration.txt \
-o runs/model/imatrix.datGenerate a saved-logit KLD base:
./build/bin/advanced-gguf-quantizer kld-command recipes/model.tomlInspect the planned command without writing a model:
./build/bin/advanced-gguf-quantizer recipe validate recipes/model.toml
./build/bin/advanced-gguf-quantizer plan recipes/model.tomlRun real quantization:
./build/bin/advanced-gguf-quantizer run recipes/model.toml --project runs/model --yesInspect and test the result:
./build/bin/llama-quantize inspect runs/model/model-nvfp4.gguf
./build/bin/llama-completion -m runs/model/model-nvfp4.gguf -p "The model is"Compact Qwen-style dense or MTP model:
./build/bin/advanced-gguf-quantizer recipe init --profile nvfp4_mxfp6 --output recipes/qwen.toml
./build/bin/advanced-gguf-quantizer run recipes/qwen.toml --project runs/qwen --yesQuality-first MoE model:
./build/bin/advanced-gguf-quantizer recipe init --profile mxfp6-primary --output recipes/moe.toml
./build/bin/advanced-gguf-quantizer run recipes/moe.toml --project runs/moe --yesRepair or edit an existing GGUF:
./build/bin/advanced-gguf-quantizer recipe init --profile repair --output recipes/repair.toml
./build/bin/advanced-gguf-quantizer run recipes/repair.toml --project runs/repair --yesUse plan to inspect a recipe. For model production, launch run; it writes
real artifacts and does not waste time on fake quantization passes.
NVFP4 when smaller size and Blackwell speed matter most. It is the smallest
advanced mode but has more quantization error, so quality runs should rely on
imatrix, saved-logit KLD evidence, p99/p999 tails, and tensor-by-tensor
promotions for more error prone or sensitive layers.
MXFP6_E2M3 when quality matters more than file size. It is larger than
NVFP4 but useful for sensitive tensors, MoE experts, output/head tensors,
or models where NVFP4-only error is high. MXFP6 is still experimental and should
be selected deliberately.
NVFP4_MXFP6 will make the fastest, best balanced model. The allocator can either:
- use MXFP6 to improve an NVFP4-first model while keeping the model compact;
- use MXFP6 as the primary target and demote selected tensors to NVFP4 to increase speed and reduce model size, when measured quality loss is acceptable.
Fallback types such as MXFP6_E2M3, Q4_K, Q6_K, Q8_0, BF16, and
F16 remain available for tensor policy, size and speed-aware candidate assignment,
output/head handling, exclusions, or repair.
For KLD selector runs, choose a quantizer mode:
fast: smallest search for relatively quick iteration.normal: default time/search balance for typical usedeep: longest, deepest search for finding incremental last remaining quality