llama-quant : fail early on missing imatrix, refactor type selection, code cleanup#19770
llama-quant : fail early on missing imatrix, refactor type selection, code cleanup#19770ggerganov merged 13 commits intoggml-org:masterfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
|
I am lucky to have had helpful feedback from @ubergarm, @AesSedai, @aldehir, @pwilkin, @compilade, and @bartowski - they all deserve a mention here. Thanks :) 🦙 |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
Marking back as draft until things have settled down a bit. |
e314fa3 to
decff8b
Compare
|
cc @ggerganov, @CISC - sorry for the messy thread, and for the re-marking as draft; there were previously a lot more changes here, but I'm going to leave some of my more aspirational improvements for future PRs. I would like to get your thought on the changes here, as to whether or not they are acceptable overall. |
ggerganov
left a comment
There was a problem hiding this comment.
Overall the refactoring goals are good. Double-check that there aren't any functional changes by comparing a few quantizations before/after and we can merge.
tools/quantize/quantize.cpp
Outdated
| #include "common.h" | ||
| #include "llama.h" | ||
| #include "gguf.h" | ||
| #include "../src/llama-quant.h" |
There was a problem hiding this comment.
Yes, though right now it's just for the struct tensor_quantization which was previously duplicated between quantize.cpp and llama-quant.cpp.
(edit: replying to ggerganov)
There was a problem hiding this comment.
Including internal sources from libllama is not OK, so better to keep the duplicated structs.
Generally, passing C++ structs across the C-style API is also not OK. This API has to be refactored at some point (#12511 (comment)).
There was a problem hiding this comment.
I understand, thanks. Will revert this change.
|
Some test results. Please let me know if there are more specific models / architectures that need to be checked. Qwen3.5-35B-A3B bf16 -> Q8_0Command:
on
|
|
I am not sure how to interpret this failing CI: https://github.com/ggml-org/llama.cpp/actions/runs/22792516852/job/66121806962#step:7:1 |
|
|
@ggerganov Any chance this can be merged? |
it's in the preliminary loop now, so needs to be on its own line
rename `tensor_quantization` to `tensor_typo_option` to descirbe its functionality
|
Should we wait for the #20112 in order to be sure about the correctness or you think it's fine to merge? @bartowski1182 WDYT? |
|
based on the changes I'm pretty confident that it won't affect anything, and my own PR may need some iterations if I get some review comments for completeness I'll try to rebase my PR off of this fork (just locally) and run the tests myself to confirm, shouldn't take long |
|
There are some changes in this versus master, but one is probably worth keeping Notably, it will get upcast to a higher level, before it was left default. The other issue is with gemma-3-4b-it test resultsso it seems that we're no longer properly detecting the (ignore the bf16/f16/f32, I should probably update my script to not try to use those types since it'll be weird no matter what) |
Yes, that is definitely wrong. Let me take a look... |
|
Should be fixed in |
|
Looks better now when with that change, now only |
… code cleanup (ggml-org#19770) * quantize : imatrix-fail early + code cleanup * fix manual override printing it's in the preliminary loop now, so needs to be on its own line * revert header changes per ggerganov * remove old #includes * clarify naming rename `tensor_quantization` to `tensor_typo_option` to descirbe its functionality * fix per barto
In ggml-org#19770, I introduced a regression in the way the `quantize_state_impl` counter values were initialized. I was incrementing and using `n_attention_wv` in the same loop, when it should have been fixed by the time we're deciding tensor types in `llama_tensor_get_type_impl` (for `use_more_bits`). I never observed a difference in any of [my tests](ggml-org#19770 (comment)) - it was only after @bartowski kindly pointed this out that I realized it was incorrect. (Thanks!)
* llama-quant : correct `n_attention_wv` usage In #19770, I introduced a regression in the way the `quantize_state_impl` counter values were initialized. I was incrementing and using `n_attention_wv` in the same loop, when it should have been fixed by the time we're deciding tensor types in `llama_tensor_get_type_impl` (for `use_more_bits`). I never observed a difference in any of [my tests](#19770 (comment)) - it was only after @bartowski kindly pointed this out that I realized it was incorrect. (Thanks!) * simplify
* llama-quant : correct `n_attention_wv` usage In ggml-org#19770, I introduced a regression in the way the `quantize_state_impl` counter values were initialized. I was incrementing and using `n_attention_wv` in the same loop, when it should have been fixed by the time we're deciding tensor types in `llama_tensor_get_type_impl` (for `use_more_bits`). I never observed a difference in any of [my tests](ggml-org#19770 (comment)) - it was only after @bartowski kindly pointed this out that I realized it was incorrect. (Thanks!) * simplify
Currently, if a quantization requires an importance matrix and one isn't provided, the program doesn't discover this until it reaches the offending tensor during the main quantization loop. Depending on model size and target type, this can mean wasting minutes to hours before the process aborts with a partial GGUF.
This PR adds a preliminary metadata pass over all tensors that determines target types upfront, enabling early validation. The quantization logic in
llama-quant.cppis refactored for clarity and correctness.Fail early for missing required imatrix
A preliminary pass now computes each tensor's target quantization type before the main loop begins. If an importance matrix is required but missing, quantization fails immediately with an error identifying the offending tensor and its target type:
The old ftype-based imatrix guard in
quantize.cppis removed in favor of this per-tensor check.tensor_requires_imatrix(renamed fromtensor_type_requires_imatrix) now uses aswitchondst_typeand correctly exemptsper_layer_token_embd.weightin addition totoken_embd.weight.Refactoring
Extracted functions to reduce the size of
llama_model_quantize_impland make the logic reusable across the preliminary and main passes, and to improve clarity and maintainability:tensor_allows_quantization: consolidates all "should we quantize this tensor?" checks (norm tensors, RWKV weights, conv1d, positional embeddings, etc.) previously inlined in the main looptensor_category+tensor_get_category: replaces repeatedname.find(...)calls with a single categorization pass; used by type selection and the attention-v check (category_is_attn_v)llama_tensor_get_type/llama_tensor_get_type_impl: splits type resolution into a wrapper (manual overrides, token_embedding/output overrides, fallbacks) and the core mixture/architecture logictensor_type_fallback: extracted from the inline incompatible-shape handling block; now also handles the rare case where the fallback type itself is incompatible (falls back to F16 with a warning)llama_ftype_get_default_type: the ftype -> ggml_type switch, extracted and organized by categorytensor_name_match_token_embd/tensor_name_match_output_weight: small helpers used across multiple call sitesOther changes
--tensor-typecompiled once in thequantize_state_implconstructor instead of compiled per-tensor, per-flagtensor_quantizationstruct moved fromllama-quant.cppto the header (shared withquantize.cppvia#includeinstead of duplicated)tensor_categoryenum andtensor_metadatastruct added to the headerhas_outputreplaced withhas_tied_embeddings(clearer semantics, inverted logic, same behavior)n_k_quantizedcounter removed (fallback count now reported againstml.n_tensors)#if 0block)