Add unit test coverage for llama_tensor_get_type by bartowski1182 · Pull Request #20112 · ggml-org/llama.cpp

bartowski1182 · 2026-03-04T18:57:30Z

This is part of a larger goal of reworking or replacing the llama_tensor_get_type function

Before major work starts in that area, I want to capture the current existing behaviour thoroughly, so that any accidental changes are easy to spot, and any purposeful changes are easy to document

To that end, this PR introduces unit test coverage for the function itself

Using a pre-set list of models, and taking advantage of the new gguf-model-data utility, these tests pull real model metadata directly from huggingface, create mock models/tensors, and runs them through the llama_tensor_get_type function and document the schema into the tests/snapshots/ directory as model-name.schema

I hope that the storage method I've decided to use won't be overly burdensome, I iterated a lot through various ways to store the existing tensor layouts, and I think this landed in an acceptable compromise area: not too many files, "only" 700 KB of storage

If this is not acceptable, I can go back to the drawing board

To capture current layouts of tensors, the script can be run with the --generate flag which will download the metadata and produce the schema files:

`--generate output`

./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y
Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3-0.6b.schema
Fetching model metadata for GLM-4.6V from ggml-org/GLM-4.6V-GGUF...
[truncated]
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-397b-a17b.schema
Fetching model metadata for Qwen3.5-27B from bartowski/Qwen_Qwen3.5-27B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

Then you can run the test itself without any arguments:

`test output`

./build/bin/test-quant-type-selection
=== Qwen3-0.6B ===
Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
  PASS  Qwen3-0.6B: 33/33 ftype sections passed (198 tensors)
[truncated]
=== Qwen3.5-27B ===
Fetching model metadata for Qwen3.5-27B from bartowski/Qwen_Qwen3.5-27B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  PASS  Qwen3.5-27B: 33/33 ftype sections passed (498 tensors)

12/12 models passed

There are a few changes to llama-quant.cpp, notably making the function itself non-static so I can access it, extracting init_quantize_state_counters and llama_ftype_default_type so I can use them without recreating them, and lastly I pulled in the change from @ddh0 in their PR #19770 that adds tensor_allows_quantization (except I made it non-static) since I needed a function like that, so once that PR is merged (and this is rebased on latest) that change will go away

I also moved a couple things like quantize_state_impl to the header file for similar reasons

Finally, I found an issue with the gguf-model-data gguf_read_uint32_val when the per-layer head counts are an array like with Step-3.5-Flash, so I've added a fix for that and a unit test to capture that behaviour

bartowski1182 · 2026-03-04T19:15:35Z

Regarding the +23,000 lines and 700kb from the .schema files, I wouldn't be opposed to instead suggesting maintainers run the script on mainline and then again on their changes, so then the files don't have to exist in the actual repo itself, would certainly declutter

bartowski1182 · 2026-03-09T16:17:07Z

tests/test-quant-type-selection.cpp

+#include "../src/llama-arch.h"
+#include "../src/llama-model.h"
+#include "../src/llama-quant.h"


@ggerganov

I know you mentioned to @ddh0 about these kinds of includes and to avoid by duplicating the structs

Figured I should check, is my case a special one because it's being done in tests?

If not, I'm willing to back-burner this and look into the changes to the API you wanted first

Tests are OK to include internal files. It's problematic for tools and examples because 3rd parties that use libllama will not be able to do that (include internal files) - they only work with the public API.

ggerganov · 2026-03-09T19:07:35Z

How long does it currently take to generate the snapshots?

bartowski1182 · 2026-03-09T19:16:55Z

16 seconds if the models aren't cached, 3 seconds if they are

timed

time ./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y

Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: downloading 2097152 bytes from Qwen3-0.6B-Q8_0.gguf
...
gguf_fetch: downloading 16777216 bytes from Qwen_Qwen3.5-27B-Q8_0.gguf
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

real    0m16.441s
user    0m5.363s
sys     0m0.713s

time ./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y

Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

real    0m3.010s
user    0m0.973s
sys     0m0.098s

bartowski1182

Adding comments to justify changes to llama-quant.cpp

bartowski1182 · 2026-03-10T19:24:50Z

src/llama-quant.cpp


-// result of parsing --tensor-type option
-// (changes to this struct must be reflected in tools/quantize/quantize.cpp)
-struct tensor_type_option {


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:24:58Z

src/llama-quant.cpp

@@ -1,25 +1,18 @@
-#include "llama.h"


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:25:21Z

src/llama-quant.cpp

+#include <algorithm>
 #include <cmath>
 #include <cstring>
-#include <string>


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:25:30Z

src/llama-quant.cpp

-// quantization state
-//
-
-struct quantize_state_impl {


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:25:43Z

src/llama-quant.cpp

 //

-static bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) {
+bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) {


Made public so I can use it in my tests

bartowski1182 · 2026-03-10T19:26:14Z

src/llama-quant.cpp

-// outer wrapper: determine the ggml_type that this tensor should be quantized to
-static ggml_type llama_tensor_get_type(quantize_state_impl & qs, const llama_model_quantize_params * params, const ggml_tensor * tensor, ggml_type default_type, const tensor_metadata & tm) {
+// public API: compute category from tensor name and delegate to _impl
+ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {


Swapped to an internal/public API so that I can avoid the need for ggml_get_name and tensor_get_category

bartowski1182 · 2026-03-10T19:26:25Z

src/llama-quant.cpp

 //

-static ggml_type llama_ftype_get_default_type(llama_ftype ftype) {
+ggml_type llama_ftype_get_default_type(llama_ftype ftype) {


Made public so it can be tested

bartowski1182 · 2026-03-10T19:27:23Z

src/llama-quant.cpp

        case LLAMA_FTYPE_MOSTLY_IQ3_S:
        case LLAMA_FTYPE_MOSTLY_IQ3_M:   return GGML_TYPE_IQ3_S;

-        default: throw std::runtime_error(format("invalid output file type %d\n", ftype));


This was causing annoyance when trying to iterate through all FTYPEs, 1 by 1, so removed the throw from here and added it in place below

I could just iterate in a way where I avoid the missing middle quants (Q4_0_4_4 etc) and revert this

bartowski1182 · 2026-03-10T19:28:03Z

src/llama-quant.cpp

+    return nullptr;
+}
+
+void init_quantize_state_counters(quantize_state_impl & qs, const std::vector<std::string> & tensor_names) {


extracted the initialization of state counters so I can use them outside the main function

bartowski1182 · 2026-03-10T19:28:20Z

src/llama-quant.cpp


-    default_type = llama_ftype_get_default_type(ftype);
+    ggml_type default_type = llama_ftype_get_default_type(ftype);
+    if (default_type == GGML_TYPE_COUNT) {


(this is where I added the throw back)

ddh0 · 2026-03-11T20:38:36Z

src/llama-quant.cpp

+struct ftype_name_entry {
+    const char * name;
+    llama_ftype  ftype;
+};
+
+static const ftype_name_entry ftype_name_table[] = {
+    { "F32",        LLAMA_FTYPE_ALL_F32 },
+    { "F16",        LLAMA_FTYPE_MOSTLY_F16 },
+    { "BF16",       LLAMA_FTYPE_MOSTLY_BF16 },
+    { "Q4_0",       LLAMA_FTYPE_MOSTLY_Q4_0 },
+    { "Q4_1",       LLAMA_FTYPE_MOSTLY_Q4_1 },
+    { "Q5_0",       LLAMA_FTYPE_MOSTLY_Q5_0 },
+    { "Q5_1",       LLAMA_FTYPE_MOSTLY_Q5_1 },
+    { "Q8_0",       LLAMA_FTYPE_MOSTLY_Q8_0 },
+    { "Q2_K",       LLAMA_FTYPE_MOSTLY_Q2_K },
+    { "Q2_K_S",     LLAMA_FTYPE_MOSTLY_Q2_K_S },
+    { "Q3_K_S",     LLAMA_FTYPE_MOSTLY_Q3_K_S },
+    { "Q3_K_M",     LLAMA_FTYPE_MOSTLY_Q3_K_M },
+    { "Q3_K_L",     LLAMA_FTYPE_MOSTLY_Q3_K_L },
+    { "Q4_K_S",     LLAMA_FTYPE_MOSTLY_Q4_K_S },
+    { "Q4_K_M",     LLAMA_FTYPE_MOSTLY_Q4_K_M },
+    { "Q5_K_S",     LLAMA_FTYPE_MOSTLY_Q5_K_S },
+    { "Q5_K_M",     LLAMA_FTYPE_MOSTLY_Q5_K_M },
+    { "Q6_K",       LLAMA_FTYPE_MOSTLY_Q6_K },
+    { "IQ1_S",      LLAMA_FTYPE_MOSTLY_IQ1_S },
+    { "IQ1_M",      LLAMA_FTYPE_MOSTLY_IQ1_M },
+    { "IQ2_XXS",    LLAMA_FTYPE_MOSTLY_IQ2_XXS },
+    { "IQ2_XS",     LLAMA_FTYPE_MOSTLY_IQ2_XS },
+    { "IQ2_S",      LLAMA_FTYPE_MOSTLY_IQ2_S },
+    { "IQ2_M",      LLAMA_FTYPE_MOSTLY_IQ2_M },
+    { "IQ3_XXS",    LLAMA_FTYPE_MOSTLY_IQ3_XXS },
+    { "IQ3_XS",     LLAMA_FTYPE_MOSTLY_IQ3_XS },
+    { "IQ3_S",      LLAMA_FTYPE_MOSTLY_IQ3_S },
+    { "IQ3_M",      LLAMA_FTYPE_MOSTLY_IQ3_M },
+    { "IQ4_NL",     LLAMA_FTYPE_MOSTLY_IQ4_NL },
+    { "IQ4_XS",     LLAMA_FTYPE_MOSTLY_IQ4_XS },
+    { "TQ1_0",      LLAMA_FTYPE_MOSTLY_TQ1_0 },
+    { "TQ2_0",      LLAMA_FTYPE_MOSTLY_TQ2_0 },
+    { "MXFP4_MOE",  LLAMA_FTYPE_MOSTLY_MXFP4_MOE },
+};
+
+llama_ftype llama_ftype_from_name(const char * name) {
+    for (const auto & e : ftype_name_table) {
+        if (strcmp(name, e.name) == 0) {
+            return e.ftype;
+        }


Can this logic reasonably be deduplicated with respect to static std::string llama_model_ftype_name from src/llama-model-loader.cpp?

If not, can it be made identical?

ddh0 · 2026-03-11T20:39:05Z

src/llama-quant.cpp

+    return (llama_ftype)-1;
+}
+
+const char * llama_ftype_to_name(llama_ftype ftype) {


(same as the above?)

ddh0 · 2026-03-11T20:41:33Z

src/llama-quant.cpp

+    // compute tensor metadata once and cache it
+    std::vector<tensor_metadata> metadata(tensors.size());
+    for (size_t i = 0; i < tensors.size(); ++i) {
+        metadata[i].name = ggml_get_name(tensors[i]->tensor);


Suggested change

metadata[i].name = ggml_get_name(tensors[i]->tensor);

metadata[i].name = tensors[i]->tensor->name;

Nitpick, not sure if it matters, but calling ggml_get_name is generally unnecessary where we can just do tensor->name.

github-actions bot added the testing Everything test related label Mar 4, 2026

bartowski1182 commented Mar 9, 2026

View reviewed changes

ggerganov mentioned this pull request Mar 9, 2026

llama-quant : fail early on missing imatrix, refactor type selection, code cleanup #19770

Merged

bartowski1182 force-pushed the llama-quant-refactor branch from 9709920 to 7374378 Compare March 10, 2026 14:45

bartowski1182 added 8 commits March 10, 2026 11:15

Add unit test coverage for llama_tensor_get_type

d6c9a63

Fix merge conflicts, add more schemas

9f63d8e

clang formatter changes

d721917

Trailing whitespace

76d795b

Update name

173e57e

Start rebase

257d5b9

Updating files with upstream changes prior to rebase

174134e

Changes needed from rebase

fedab8e

Update attn_qkv schema, change throw behaviour

36f36a1

bartowski1182 force-pushed the llama-quant-refactor branch from 7374378 to 36f36a1 Compare March 10, 2026 15:24

bartowski1182 added 2 commits March 10, 2026 11:45

Fix merge conflicts

48e2d30

White space

198f234

bartowski1182 commented Mar 10, 2026

View reviewed changes

bartowski1182 added 5 commits March 11, 2026 10:38

Merge branch 'master' into llama-quant-refactor

2a4fb7c

Update with latest changes to state counters

5197229

Revert accidental personal CLAUDE.md changes

e948979

Change quotation mark

99601b3

Reuse metadata.name since we have it

60f7003

ddh0 reviewed Mar 11, 2026

View reviewed changes

src/llama-quant.cpp

return (llama_ftype)-1;

}

const char * llama_ftype_to_name(llama_ftype ftype) {

Copy link

Contributor

ddh0 Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as the above?)

ddh0 reviewed Mar 11, 2026

View reviewed changes

	metadata[i].name = ggml_get_name(tensors[i]->tensor);
	metadata[i].name = tensors[i]->tensor->name;

Conversation

bartowski1182 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

bartowski1182 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bartowski1182 commented Mar 4, 2026 •

edited

Loading

bartowski1182 commented Mar 4, 2026 •

edited

Loading

bartowski1182 commented Mar 9, 2026 •

edited

Loading