add --dry-run option to llama-quantize#19526
Conversation
|
AI disclosure: I used Claude to help with understanding which changes need to be made, but I did all the changes by hand. |
|
I think this is ready for review. I've tested it with gemma-3-4b (dense), granite-4.0-micro (dense hybrid), and granite-4.0-tiny-preview (hybrid MoE) and the calculated sizes are exactly right for all three (comparing |
|
There are two small unrelated changes that are included here:
|
|
Also tested with |
|
The latest commits are a minor refactor of if ((new_type == GGML_TYPE_IQ2_XXS ||
new_type == GGML_TYPE_IQ2_XS ||
new_type == GGML_TYPE_IQ2_S ||
new_type == GGML_TYPE_IQ1_S ||
(new_type == GGML_TYPE_IQ1_M && strcmp(tensor->name, "token_embd.weight") && strcmp(tensor->name, "output.weight")) ||
(new_type == GGML_TYPE_Q2_K && params->ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && strcmp(tensor->name, "token_embd.weight") != 0)) && !imatrix) { ... }into a new function: static bool tensor_type_requires_imatrix(const llama_model_quantize_params * params, const ggml_tensor * t, const ggml_type dst_type) {
return (
dst_type == GGML_TYPE_IQ2_XXS || dst_type == GGML_TYPE_IQ2_XS ||
dst_type == GGML_TYPE_IQ3_XXS || dst_type == GGML_TYPE_IQ1_S ||
dst_type == GGML_TYPE_IQ2_S || dst_type == GGML_TYPE_IQ1_M ||
( // Q2_K is the worst k-quant type - only allow it without imatrix for token embeddings
dst_type == GGML_TYPE_Q2_K && strcmp(t->name, "token_embd.weight") != 0
)
);
}so I can re-use it for dry-run, and added new conditional warning, which gives a heads-up if performing this quantization will require an imatrix. This will prevent many headaches for me personally, and hopefully others :)
|
|
I'm biased because I helped suggest this feature to @ddh0 and provided a proof-of-concept, but I think that this would help the quality of life for people who frequently quant models. I often receive requests for quants fitting into certain size thresholds and if you're talking about a 200B+ MoE, it's at least twenty minutes to produce the resulting GGUF. If you happen to overshoot (or undershoot) the size quota, then that means you have to adjust the recipe and try again, and it honestly is frustrating have to guess-and-check. I've also ran into failed quantizations many times with MTP tensors for instance, located at the end of the model usually, when one tries to do a low-bpw quant of the GLM MoE's for instance. If you don't override the MTP tensors to a higher quantization, they don't have imatrix importance data and the quant bails out. Very frustrating experience. Overall I think these are nice features and would love to see it merged. |
|
Gentle ping @ggerganov - any chance this can be merged? |
|
Something fails with quantize CI test: https://github.com/ggml-org/llama.cpp/actions/runs/22198756896/job/64205914038?pr=19526 |
|
Hmm, it looks like the test that fails is using this command:
and fails with: This seems fine to me; it didn't provide an imatrix, so it's correct for this to fail. However right after that: No idea what this is, or why all the results at the end are blank. It seems unrelated to me, but I'm not sure. Let me know how to proceed. |
|
If it would be easier, I can take out the (Edit: Though, that would be kind of a mess now that I think about it. Hmm...) |
Ah, that's the problem, you basically changed the check from |
|
OK, I'll change the check back to use the same conditions as before, since this change is not related to |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
For the record, about that Q2_K condition: static bool tensor_type_requires_imatrix(const ggml_tensor * t, const ggml_type dst_type, const llama_ftype ftype) {
return (
dst_type == GGML_TYPE_IQ2_XXS || dst_type == GGML_TYPE_IQ2_XS ||
dst_type == GGML_TYPE_IQ3_XXS || dst_type == GGML_TYPE_IQ1_S ||
dst_type == GGML_TYPE_IQ2_S || dst_type == GGML_TYPE_IQ1_M ||
( // Q2_K_S is the worst k-quant type - only allow it without imatrix for token embeddings
dst_type == GGML_TYPE_Q2_K && ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && strcmp(t->name, "token_embd.weight") != 0
)
);
}Making the per-tensor imatrix requirement conditional on the tensor itself ( |
The reason it's there is because of this: Lines 237 to 244 in b1123f9 and these: Lines 255 to 270 in b1123f9 |
|
I see, thank you. I'll probably just leave that logic how it is, then. Merge when green? 🤞 |
Yep. :) BTW, that |
* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* clean slate for branch * use 6 characters for tensor dims * add --dry-run to llama-quantize * use 6 characters for tensor dims (cont.) * no need to re-calculate ggml_nbytes for tensor * fix indent * show model and quant BPW when quant completes * add example to --help * new function `tensor_requires_imatrix`, add courtesy warning about imatrix * missing __func__, move imatrix flag set * logic error * fixup tensor_requires_imatrix * add missing `GGML_TYPE`s * simplify and rename `tensor_type_requires_imatrix` * simplify for style * add back Q2_K edge case for imatrix * guard ftype imatrix warning * comment ref ggml-org#12557 * remove per @compilade * remove unused `params` parameter * move `bool dry_run` per GG * move `bool dry_run` per GG * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
This PR adds a new
--dry-runoption tollama-quantize. This option calculates the size of each tensor in the target type without actually performing quantization, and prints the final quantization size in the same way thatllama-quantizedoes currently.Example command:
Example output:
Credit to @AesSedai for this idea - he has a preliminary version that can be seen here. His version supports calculating the size for all possible quantization types and creating a measurement file that can be re-used for any quantization. For now, this is just a simple calculation that runs on every tensor, with no fancy options.