Skip to content

add --dry-run option to llama-quantize#19526

Merged
CISC merged 33 commits intoggml-org:masterfrom
ddh0:llama-quantize-dry-run
Feb 20, 2026
Merged

add --dry-run option to llama-quantize#19526
CISC merged 33 commits intoggml-org:masterfrom
ddh0:llama-quantize-dry-run

Conversation

@ddh0
Copy link
Contributor

@ddh0 ddh0 commented Feb 11, 2026

This PR adds a new --dry-run option to llama-quantize. This option calculates the size of each tensor in the target type without actually performing quantization, and prints the final quantization size in the same way that llama-quantize does currently.

Example command:

llama-quantize --dry-run gemma-3-4b-it-q8_0.gguf Q4_K

Example output:

main: build = 8015 (07f882bbb)
main: built with AppleClang 17.0.0.17000603 for Darwin arm64
main: calculating quantization size for '/Users/dylan/Documents/AI/gguf/gemma-3-4b-it-q8_0.gguf' as Q4_K
llama_model_loader: loaded meta data with 41 key-value pairs and 444 tensors from /Users/dylan/Documents/AI/gguf/gemma-3-4b-it-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gemma-3-4b-it
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                          general.file_type u32              = 7
llama_model_loader: - kv  21:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  23:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  24:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  25:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  37:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  38:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  39:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  40:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q8_0:  239 tensors
[   1/ 444]                   output_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[   2/ 444]                    token_embd.weight - [  2560, 262208,      1,      1], type =   q8_0, size =   680.17 MiB ->   525.13 MiB (q6_K)
[   3/ 444]                  blk.0.attn_k.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     1.41 MiB (q4_K)
[   4/ 444]             blk.0.attn_k_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[   5/ 444]               blk.0.attn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[   6/ 444]             blk.0.attn_output.weight - [  2048,   2560,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[   7/ 444]                  blk.0.attn_q.weight - [  2560,   2048,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[   8/ 444]             blk.0.attn_q_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[   9/ 444]                  blk.0.attn_v.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     2.05 MiB (q6_K)
[  10/ 444]                blk.0.ffn_down.weight - [ 10240,   2560,      1,      1], type =   q8_0, size =    26.56 MiB ->    20.51 MiB (q6_K)
[  11/ 444]                blk.0.ffn_gate.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[  12/ 444]                blk.0.ffn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[  13/ 444]                  blk.0.ffn_up.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[  14/ 444]     blk.0.post_attention_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[  15/ 444]           blk.0.post_ffw_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
|
|  ... truncated for brevity ...
|
[ 432/ 444]                 blk.33.attn_k.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     1.41 MiB (q4_K)
[ 433/ 444]            blk.33.attn_k_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[ 434/ 444]              blk.33.attn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[ 435/ 444]            blk.33.attn_output.weight - [  2048,   2560,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[ 436/ 444]                 blk.33.attn_q.weight - [  2560,   2048,      1,      1], type =   q8_0, size =     5.31 MiB ->     2.81 MiB (q4_K)
[ 437/ 444]            blk.33.attn_q_norm.weight - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[ 438/ 444]                 blk.33.attn_v.weight - [  2560,   1024,      1,      1], type =   q8_0, size =     2.66 MiB ->     2.05 MiB (q6_K)
[ 439/ 444]               blk.33.ffn_down.weight - [ 10240,   2560,      1,      1], type =   q8_0, size =    26.56 MiB ->    20.51 MiB (q6_K)
[ 440/ 444]               blk.33.ffn_gate.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[ 441/ 444]               blk.33.ffn_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[ 442/ 444]                 blk.33.ffn_up.weight - [  2560,  10240,      1,      1], type =   q8_0, size =    26.56 MiB ->    14.06 MiB (q4_K)
[ 443/ 444]    blk.33.post_attention_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
[ 444/ 444]          blk.33.post_ffw_norm.weight - [  2560,      1,      1,      1], type =    f32, size =    0.010 MiB
llama_model_quantize_impl: model size  =  3932.82 MiB (8.50 BPW)
llama_model_quantize_impl: quant size  =  2368.31 MiB (5.12 BPW)

main: quantize time =   139.49 ms
main:    total time =   139.49 ms

Credit to @AesSedai for this idea - he has a preliminary version that can be seen here. His version supports calculating the size for all possible quantization types and creating a measurement file that can be re-used for any quantization. For now, this is just a simple calculation that runs on every tensor, with no fancy options.

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 11, 2026

AI disclosure: I used Claude to help with understanding which changes need to be made, but I did all the changes by hand.

@ddh0 ddh0 marked this pull request as ready for review February 11, 2026 21:41
@ddh0 ddh0 requested a review from ggerganov as a code owner February 11, 2026 21:41
@ddh0
Copy link
Contributor Author

ddh0 commented Feb 11, 2026

I think this is ready for review. I've tested it with gemma-3-4b (dense), granite-4.0-micro (dense hybrid), and granite-4.0-tiny-preview (hybrid MoE) and the calculated sizes are exactly right for all three (comparing --dry-run to an actual quantization). Not sure if there is some edge case I'm missing.

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 11, 2026

There are two small unrelated changes that are included here:

  • in the output of llama_model_quantize_impl, at the end where it prints model size = ... and quant size = ..., I added BPW to each of these lines for convenience (no longer have to load the model to see the final BPW)
  • changed the number of characters used for tensor dimensions in llama_format_tensor_shape from 5 to 6 (noticed it was not enough when testing gemma)

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 12, 2026

Also tested with --tensor-type overrides and it works with those, too.

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 12, 2026

The latest commits are a minor refactor of llama_model_quantize_impl. I moved this old check:

            if ((new_type == GGML_TYPE_IQ2_XXS ||
                 new_type == GGML_TYPE_IQ2_XS  ||
                 new_type == GGML_TYPE_IQ2_S   ||
                 new_type == GGML_TYPE_IQ1_S   ||
                (new_type == GGML_TYPE_IQ1_M && strcmp(tensor->name, "token_embd.weight") && strcmp(tensor->name, "output.weight"))  ||
                (new_type == GGML_TYPE_Q2_K && params->ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && strcmp(tensor->name, "token_embd.weight") != 0)) && !imatrix) { ... }

into a new function:

static bool tensor_type_requires_imatrix(const llama_model_quantize_params * params, const ggml_tensor * t, const ggml_type dst_type) {
    return (
        dst_type == GGML_TYPE_IQ2_XXS || dst_type == GGML_TYPE_IQ2_XS ||
        dst_type == GGML_TYPE_IQ3_XXS || dst_type == GGML_TYPE_IQ1_S  ||
        dst_type == GGML_TYPE_IQ2_S   || dst_type == GGML_TYPE_IQ1_M  ||
        (   // Q2_K is the worst k-quant type - only allow it without imatrix for token embeddings
            dst_type == GGML_TYPE_Q2_K && strcmp(t->name, "token_embd.weight") != 0
        )
    );
}

so I can re-use it for dry-run, and added new conditional warning, which gives a heads-up if performing this quantization will require an imatrix. This will prevent many headaches for me personally, and hopefully others :)

llama_model_quantize_impl: WARNING: dry run completed successfully, but actually completing this quantization will require an imatrix!

@AesSedai
Copy link
Contributor

I'm biased because I helped suggest this feature to @ddh0 and provided a proof-of-concept, but I think that this would help the quality of life for people who frequently quant models.

I often receive requests for quants fitting into certain size thresholds and if you're talking about a 200B+ MoE, it's at least twenty minutes to produce the resulting GGUF. If you happen to overshoot (or undershoot) the size quota, then that means you have to adjust the recipe and try again, and it honestly is frustrating have to guess-and-check.

I've also ran into failed quantizations many times with MTP tensors for instance, located at the end of the model usually, when one tries to do a low-bpw quant of the GLM MoE's for instance. If you don't override the MTP tensors to a higher quantization, they don't have imatrix importance data and the quant bails out. Very frustrating experience.

Overall I think these are nice features and would love to see it merged.

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 18, 2026

Gentle ping @ggerganov - any chance this can be merged?

@CISC
Copy link
Collaborator

CISC commented Feb 19, 2026

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 19, 2026

Hmm, it looks like the test that fails is using this command:

./bin/llama-quantize ../models-mnt/qwen3/0.6B/ggml-model-bf16.gguf ../models-mnt/qwen3/0.6B/ggml-model-q2_k.gguf q2_k 8

and fails with:

2026-02-19T20:40:31.1084399Z ============================================================
2026-02-19T20:40:31.1085296Z Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
2026-02-19T20:40:31.1086200Z The result will be garbage, so bailing out
2026-02-19T20:40:31.1086832Z ============================================================
2026-02-19T20:40:31.1087256Z 
2026-02-19T20:40:31.5579790Z llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
2026-02-19T20:40:31.5581416Z main: failed to quantize model from '../models-mnt/qwen3/0.6B/ggml-model-bf16.gguf'

This seems fine to me; it didn't provide an imatrix, so it's correct for this to fail.

However right after that:

2026-02-19T20:40:31.5657945Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-ppl.log: No such file or directory
2026-02-19T20:40:31.5683643Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-imatrix-sum.log: No such file or directory
2026-02-19T20:40:31.5707597Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-f16.log: No such file or directory
2026-02-19T20:40:31.5731845Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-bf16.log: No such file or directory
2026-02-19T20:40:31.5757933Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q8_0.log: No such file or directory
2026-02-19T20:40:31.5783377Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q4_0.log: No such file or directory
2026-02-19T20:40:31.5807355Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q4_1.log: No such file or directory
2026-02-19T20:40:31.5830700Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q5_0.log: No such file or directory
2026-02-19T20:40:31.5855480Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q5_1.log: No such file or directory
2026-02-19T20:40:31.5880203Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q2_k.log: No such file or directory
2026-02-19T20:40:31.5902849Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q3_k.log: No such file or directory
2026-02-19T20:40:31.5926651Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q4_k.log: No such file or directory
2026-02-19T20:40:31.5950618Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q5_k.log: No such file or directory
2026-02-19T20:40:31.5973404Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-tg-q6_k.log: No such file or directory
2026-02-19T20:40:31.5997086Z cat: /home/ggml/results/llama.cpp/qwen3_0_6b-save-load-state.log: No such file or directory

No idea what this is, or why all the results at the end are blank. It seems unrelated to me, but I'm not sure. Let me know how to proceed.

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 19, 2026

If it would be easier, I can take out the tensor_type_requires_imatrix and related changes in this PR, and move all the refactoring to #19616, leaving this as only --dry-run. But still, I'm confused as to why the tests are failing.

(Edit: Though, that would be kind of a mess now that I think about it. Hmm...)

@CISC
Copy link
Collaborator

CISC commented Feb 19, 2026

Hmm, it looks like the test that fails is using this command:

./bin/llama-quantize ../models-mnt/qwen3/0.6B/ggml-model-bf16.gguf ../models-mnt/qwen3/0.6B/ggml-model-q2_k.gguf q2_k 8

This seems fine to me; it didn't provide an imatrix, so it's correct for this to fail.

Ah, that's the problem, you basically changed the check from new_type == GGML_TYPE_Q2_K && params->ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S to just new_type == GGML_TYPE_Q2_K, and now it fails.

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 19, 2026

OK, I'll change the check back to use the same conditions as before, since this change is not related to --dry-run anyway. Will leave those changes to the next PR.

ddh0 and others added 3 commits February 19, 2026 16:03
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@ddh0
Copy link
Contributor Author

ddh0 commented Feb 19, 2026

For the record, about that Q2_K condition:

static bool tensor_type_requires_imatrix(const ggml_tensor * t, const ggml_type dst_type, const llama_ftype ftype) {
    return (
        dst_type == GGML_TYPE_IQ2_XXS || dst_type == GGML_TYPE_IQ2_XS ||
        dst_type == GGML_TYPE_IQ3_XXS || dst_type == GGML_TYPE_IQ1_S  ||
        dst_type == GGML_TYPE_IQ2_S   || dst_type == GGML_TYPE_IQ1_M  ||
        (   // Q2_K_S is the worst k-quant type - only allow it without imatrix for token embeddings
            dst_type == GGML_TYPE_Q2_K && ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && strcmp(t->name, "token_embd.weight") != 0
        )
    );
}

Making the per-tensor imatrix requirement conditional on the tensor itself (t->type and t->name) makes sense, but I think that making it conditional on the overall ftype (Q2_K_S) does not make sense, and I will attempt remove it cleanly in the refactor.

@CISC
Copy link
Collaborator

CISC commented Feb 19, 2026

Making the per-tensor imatrix requirement conditional on the tensor itself (t->type and t->name) makes sense, but I think that making it conditional on the overall ftype (Q2_K_S) does not make sense, and I will attempt remove it cleanly in the refactor.

The reason it's there is because of this:

} else if (name == "token_embd.weight" || name == "per_layer_token_embd.weight") {
if (qs.params->token_embedding_type < GGML_TYPE_COUNT) {
new_type = qs.params->token_embedding_type;
} else {
if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
new_type = GGML_TYPE_Q2_K;
}

and these:
} else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
if (name.find("attn_v.weight") != std::string::npos) {
if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;
else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
++qs.i_attention_wv;
}
else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {
new_type = GGML_TYPE_Q4_K;
}
else if (name.find("ffn_down") != std::string::npos) {
if (qs.i_ffn_down < qs.n_ffn_down/8) {
new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
}
++qs.i_ffn_down;
}

@ddh0
Copy link
Contributor Author

ddh0 commented Feb 19, 2026

I see, thank you. I'll probably just leave that logic how it is, then. Merge when green? 🤞

@CISC
Copy link
Collaborator

CISC commented Feb 19, 2026

I see, thank you. I'll probably just leave that logic how it is, then. Merge when green? 🤞

Yep. :)

BTW, that per_layer_token_embd.weight was probably added some time later. Means that check should be amended, but another PR...

@CISC CISC merged commit 492bc31 into ggml-org:master Feb 20, 2026
78 checks passed
@ddh0 ddh0 deleted the llama-quantize-dry-run branch February 20, 2026 18:15
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* clean slate for branch

* use 6 characters for tensor dims

* add --dry-run to llama-quantize

* use 6 characters for tensor dims (cont.)

* no need to re-calculate ggml_nbytes for tensor

* fix indent

* show model and quant BPW when quant completes

* add example to --help

* new function `tensor_requires_imatrix`, add courtesy warning about imatrix

* missing __func__, move imatrix flag set

* logic error

* fixup tensor_requires_imatrix

* add missing `GGML_TYPE`s

* simplify and rename `tensor_type_requires_imatrix`

* simplify for style

* add back Q2_K edge case for imatrix

* guard ftype imatrix warning

* comment ref ggml-org#12557

* remove per @compilade

* remove unused `params` parameter

* move `bool dry_run` per GG

* move `bool dry_run` per GG

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* clean slate for branch

* use 6 characters for tensor dims

* add --dry-run to llama-quantize

* use 6 characters for tensor dims (cont.)

* no need to re-calculate ggml_nbytes for tensor

* fix indent

* show model and quant BPW when quant completes

* add example to --help

* new function `tensor_requires_imatrix`, add courtesy warning about imatrix

* missing __func__, move imatrix flag set

* logic error

* fixup tensor_requires_imatrix

* add missing `GGML_TYPE`s

* simplify and rename `tensor_type_requires_imatrix`

* simplify for style

* add back Q2_K edge case for imatrix

* guard ftype imatrix warning

* comment ref ggml-org#12557

* remove per @compilade

* remove unused `params` parameter

* move `bool dry_run` per GG

* move `bool dry_run` per GG

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* clean slate for branch

* use 6 characters for tensor dims

* add --dry-run to llama-quantize

* use 6 characters for tensor dims (cont.)

* no need to re-calculate ggml_nbytes for tensor

* fix indent

* show model and quant BPW when quant completes

* add example to --help

* new function `tensor_requires_imatrix`, add courtesy warning about imatrix

* missing __func__, move imatrix flag set

* logic error

* fixup tensor_requires_imatrix

* add missing `GGML_TYPE`s

* simplify and rename `tensor_type_requires_imatrix`

* simplify for style

* add back Q2_K edge case for imatrix

* guard ftype imatrix warning

* comment ref ggml-org#12557

* remove per @compilade

* remove unused `params` parameter

* move `bool dry_run` per GG

* move `bool dry_run` per GG

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants