Skip to content

Add unit test coverage for llama_tensor_get_type#20112

Draft
bartowski1182 wants to merge 16 commits intoggml-org:masterfrom
bartowski1182:llama-quant-refactor
Draft

Add unit test coverage for llama_tensor_get_type#20112
bartowski1182 wants to merge 16 commits intoggml-org:masterfrom
bartowski1182:llama-quant-refactor

Conversation

@bartowski1182
Copy link
Contributor

@bartowski1182 bartowski1182 commented Mar 4, 2026

This is part of a larger goal of reworking or replacing the llama_tensor_get_type function

Before major work starts in that area, I want to capture the current existing behaviour thoroughly, so that any accidental changes are easy to spot, and any purposeful changes are easy to document

To that end, this PR introduces unit test coverage for the function itself

Using a pre-set list of models, and taking advantage of the new gguf-model-data utility, these tests pull real model metadata directly from huggingface, create mock models/tensors, and runs them through the llama_tensor_get_type function and document the schema into the tests/snapshots/ directory as model-name.schema

I hope that the storage method I've decided to use won't be overly burdensome, I iterated a lot through various ways to store the existing tensor layouts, and I think this landed in an acceptable compromise area: not too many files, "only" 700 KB of storage

If this is not acceptable, I can go back to the drawing board

To capture current layouts of tensors, the script can be run with the --generate flag which will download the metadata and produce the schema files:

`--generate output`
./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y
Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3-0.6b.schema
Fetching model metadata for GLM-4.6V from ggml-org/GLM-4.6V-GGUF...
[truncated]
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-397b-a17b.schema
Fetching model metadata for Qwen3.5-27B from bartowski/Qwen_Qwen3.5-27B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

Then you can run the test itself without any arguments:

`test output`
./build/bin/test-quant-type-selection
=== Qwen3-0.6B ===
Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
  PASS  Qwen3-0.6B: 33/33 ftype sections passed (198 tensors)
[truncated]
=== Qwen3.5-27B ===
Fetching model metadata for Qwen3.5-27B from bartowski/Qwen_Qwen3.5-27B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  PASS  Qwen3.5-27B: 33/33 ftype sections passed (498 tensors)

12/12 models passed

There are a few changes to llama-quant.cpp, notably making the function itself non-static so I can access it, extracting init_quantize_state_counters and llama_ftype_default_type so I can use them without recreating them, and lastly I pulled in the change from @ddh0 in their PR #19770 that adds tensor_allows_quantization (except I made it non-static) since I needed a function like that, so once that PR is merged (and this is rebased on latest) that change will go away

I also moved a couple things like quantize_state_impl to the header file for similar reasons

Finally, I found an issue with the gguf-model-data gguf_read_uint32_val when the per-layer head counts are an array like with Step-3.5-Flash, so I've added a fix for that and a unit test to capture that behaviour

@github-actions github-actions bot added the testing Everything test related label Mar 4, 2026
@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Mar 4, 2026

Regarding the +23,000 lines and 700kb from the .schema files, I wouldn't be opposed to instead suggesting maintainers run the script on mainline and then again on their changes, so then the files don't have to exist in the actual repo itself, would certainly declutter

Comment on lines +1 to +3
#include "../src/llama-arch.h"
#include "../src/llama-model.h"
#include "../src/llama-quant.h"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov

I know you mentioned to @ddh0 about these kinds of includes and to avoid by duplicating the structs

Figured I should check, is my case a special one because it's being done in tests?

If not, I'm willing to back-burner this and look into the changes to the API you wanted first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are OK to include internal files. It's problematic for tools and examples because 3rd parties that use libllama will not be able to do that (include internal files) - they only work with the public API.

@ggerganov
Copy link
Member

How long does it currently take to generate the snapshots?

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Mar 9, 2026

16 seconds if the models aren't cached, 3 seconds if they are

timed
time ./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y

Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: downloading 2097152 bytes from Qwen3-0.6B-Q8_0.gguf
...
gguf_fetch: downloading 16777216 bytes from Qwen_Qwen3.5-27B-Q8_0.gguf
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

real    0m16.441s
user    0m5.363s
sys     0m0.713s
time ./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y

Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

real    0m3.010s
user    0m0.973s
sys     0m0.098s

@github-actions github-actions bot added documentation Improvements or additions to documentation model Model specific script Script related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes devops improvements to build systems and github actions server ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs OpenCL Issues specific to the OpenCL backend jinja parser Issues related to the jinja parser labels Mar 10, 2026
Copy link
Contributor Author

@bartowski1182 bartowski1182 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding comments to justify changes to llama-quant.cpp


// result of parsing --tensor-type option
// (changes to this struct must be reflected in tools/quantize/quantize.cpp)
struct tensor_type_option {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to llama-quant.h

@@ -1,25 +1,18 @@
#include "llama.h"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to llama-quant.h

#include <algorithm>
#include <cmath>
#include <cstring>
#include <string>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to llama-quant.h

// quantization state
//

struct quantize_state_impl {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to llama-quant.h

//

static bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) {
bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made public so I can use it in my tests

// outer wrapper: determine the ggml_type that this tensor should be quantized to
static ggml_type llama_tensor_get_type(quantize_state_impl & qs, const llama_model_quantize_params * params, const ggml_tensor * tensor, ggml_type default_type, const tensor_metadata & tm) {
// public API: compute category from tensor name and delegate to _impl
ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapped to an internal/public API so that I can avoid the need for ggml_get_name and tensor_get_category

//

static ggml_type llama_ftype_get_default_type(llama_ftype ftype) {
ggml_type llama_ftype_get_default_type(llama_ftype ftype) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made public so it can be tested

case LLAMA_FTYPE_MOSTLY_IQ3_S:
case LLAMA_FTYPE_MOSTLY_IQ3_M: return GGML_TYPE_IQ3_S;

default: throw std::runtime_error(format("invalid output file type %d\n", ftype));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing annoyance when trying to iterate through all FTYPEs, 1 by 1, so removed the throw from here and added it in place below

I could just iterate in a way where I avoid the missing middle quants (Q4_0_4_4 etc) and revert this

return nullptr;
}

void init_quantize_state_counters(quantize_state_impl & qs, const std::vector<std::string> & tensor_names) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted the initialization of state counters so I can use them outside the main function


default_type = llama_ftype_get_default_type(ftype);
ggml_type default_type = llama_ftype_get_default_type(ftype);
if (default_type == GGML_TYPE_COUNT) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is where I added the throw back)

Comment on lines +758 to +803
struct ftype_name_entry {
const char * name;
llama_ftype ftype;
};

static const ftype_name_entry ftype_name_table[] = {
{ "F32", LLAMA_FTYPE_ALL_F32 },
{ "F16", LLAMA_FTYPE_MOSTLY_F16 },
{ "BF16", LLAMA_FTYPE_MOSTLY_BF16 },
{ "Q4_0", LLAMA_FTYPE_MOSTLY_Q4_0 },
{ "Q4_1", LLAMA_FTYPE_MOSTLY_Q4_1 },
{ "Q5_0", LLAMA_FTYPE_MOSTLY_Q5_0 },
{ "Q5_1", LLAMA_FTYPE_MOSTLY_Q5_1 },
{ "Q8_0", LLAMA_FTYPE_MOSTLY_Q8_0 },
{ "Q2_K", LLAMA_FTYPE_MOSTLY_Q2_K },
{ "Q2_K_S", LLAMA_FTYPE_MOSTLY_Q2_K_S },
{ "Q3_K_S", LLAMA_FTYPE_MOSTLY_Q3_K_S },
{ "Q3_K_M", LLAMA_FTYPE_MOSTLY_Q3_K_M },
{ "Q3_K_L", LLAMA_FTYPE_MOSTLY_Q3_K_L },
{ "Q4_K_S", LLAMA_FTYPE_MOSTLY_Q4_K_S },
{ "Q4_K_M", LLAMA_FTYPE_MOSTLY_Q4_K_M },
{ "Q5_K_S", LLAMA_FTYPE_MOSTLY_Q5_K_S },
{ "Q5_K_M", LLAMA_FTYPE_MOSTLY_Q5_K_M },
{ "Q6_K", LLAMA_FTYPE_MOSTLY_Q6_K },
{ "IQ1_S", LLAMA_FTYPE_MOSTLY_IQ1_S },
{ "IQ1_M", LLAMA_FTYPE_MOSTLY_IQ1_M },
{ "IQ2_XXS", LLAMA_FTYPE_MOSTLY_IQ2_XXS },
{ "IQ2_XS", LLAMA_FTYPE_MOSTLY_IQ2_XS },
{ "IQ2_S", LLAMA_FTYPE_MOSTLY_IQ2_S },
{ "IQ2_M", LLAMA_FTYPE_MOSTLY_IQ2_M },
{ "IQ3_XXS", LLAMA_FTYPE_MOSTLY_IQ3_XXS },
{ "IQ3_XS", LLAMA_FTYPE_MOSTLY_IQ3_XS },
{ "IQ3_S", LLAMA_FTYPE_MOSTLY_IQ3_S },
{ "IQ3_M", LLAMA_FTYPE_MOSTLY_IQ3_M },
{ "IQ4_NL", LLAMA_FTYPE_MOSTLY_IQ4_NL },
{ "IQ4_XS", LLAMA_FTYPE_MOSTLY_IQ4_XS },
{ "TQ1_0", LLAMA_FTYPE_MOSTLY_TQ1_0 },
{ "TQ2_0", LLAMA_FTYPE_MOSTLY_TQ2_0 },
{ "MXFP4_MOE", LLAMA_FTYPE_MOSTLY_MXFP4_MOE },
};

llama_ftype llama_ftype_from_name(const char * name) {
for (const auto & e : ftype_name_table) {
if (strcmp(name, e.name) == 0) {
return e.ftype;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic reasonably be deduplicated with respect to static std::string llama_model_ftype_name from src/llama-model-loader.cpp?

If not, can it be made identical?

return (llama_ftype)-1;
}

const char * llama_ftype_to_name(llama_ftype ftype) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as the above?)

// compute tensor metadata once and cache it
std::vector<tensor_metadata> metadata(tensors.size());
for (size_t i = 0; i < tensors.size(); ++i) {
metadata[i].name = ggml_get_name(tensors[i]->tensor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata[i].name = ggml_get_name(tensors[i]->tensor);
metadata[i].name = tensors[i]->tensor->name;

Nitpick, not sure if it matters, but calling ggml_get_name is generally unnecessary where we can just do tensor->name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs devops improvements to build systems and github actions documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning jinja parser Issues related to the jinja parser model Model specific Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend python python script changes script Script related server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants