Add unit test coverage for llama_tensor_get_type#20112
Add unit test coverage for llama_tensor_get_type#20112bartowski1182 wants to merge 16 commits intoggml-org:masterfrom
Conversation
|
Regarding the +23,000 lines and 700kb from the .schema files, I wouldn't be opposed to instead suggesting maintainers run the script on mainline and then again on their changes, so then the files don't have to exist in the actual repo itself, would certainly declutter |
| #include "../src/llama-arch.h" | ||
| #include "../src/llama-model.h" | ||
| #include "../src/llama-quant.h" |
There was a problem hiding this comment.
I know you mentioned to @ddh0 about these kinds of includes and to avoid by duplicating the structs
Figured I should check, is my case a special one because it's being done in tests?
If not, I'm willing to back-burner this and look into the changes to the API you wanted first
There was a problem hiding this comment.
Tests are OK to include internal files. It's problematic for tools and examples because 3rd parties that use libllama will not be able to do that (include internal files) - they only work with the public API.
|
How long does it currently take to generate the snapshots? |
|
16 seconds if the models aren't cached, 3 seconds if they are timed |
9709920 to
7374378
Compare
7374378 to
36f36a1
Compare
bartowski1182
left a comment
There was a problem hiding this comment.
Adding comments to justify changes to llama-quant.cpp
|
|
||
| // result of parsing --tensor-type option | ||
| // (changes to this struct must be reflected in tools/quantize/quantize.cpp) | ||
| struct tensor_type_option { |
There was a problem hiding this comment.
Moved to llama-quant.h
| @@ -1,25 +1,18 @@ | |||
| #include "llama.h" | |||
There was a problem hiding this comment.
Moved to llama-quant.h
| #include <algorithm> | ||
| #include <cmath> | ||
| #include <cstring> | ||
| #include <string> |
There was a problem hiding this comment.
Moved to llama-quant.h
| // quantization state | ||
| // | ||
|
|
||
| struct quantize_state_impl { |
There was a problem hiding this comment.
Moved to llama-quant.h
| // | ||
|
|
||
| static bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) { | ||
| bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) { |
There was a problem hiding this comment.
Made public so I can use it in my tests
src/llama-quant.cpp
Outdated
| // outer wrapper: determine the ggml_type that this tensor should be quantized to | ||
| static ggml_type llama_tensor_get_type(quantize_state_impl & qs, const llama_model_quantize_params * params, const ggml_tensor * tensor, ggml_type default_type, const tensor_metadata & tm) { | ||
| // public API: compute category from tensor name and delegate to _impl | ||
| ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) { |
There was a problem hiding this comment.
Swapped to an internal/public API so that I can avoid the need for ggml_get_name and tensor_get_category
| // | ||
|
|
||
| static ggml_type llama_ftype_get_default_type(llama_ftype ftype) { | ||
| ggml_type llama_ftype_get_default_type(llama_ftype ftype) { |
There was a problem hiding this comment.
Made public so it can be tested
| case LLAMA_FTYPE_MOSTLY_IQ3_S: | ||
| case LLAMA_FTYPE_MOSTLY_IQ3_M: return GGML_TYPE_IQ3_S; | ||
|
|
||
| default: throw std::runtime_error(format("invalid output file type %d\n", ftype)); |
There was a problem hiding this comment.
This was causing annoyance when trying to iterate through all FTYPEs, 1 by 1, so removed the throw from here and added it in place below
I could just iterate in a way where I avoid the missing middle quants (Q4_0_4_4 etc) and revert this
src/llama-quant.cpp
Outdated
| return nullptr; | ||
| } | ||
|
|
||
| void init_quantize_state_counters(quantize_state_impl & qs, const std::vector<std::string> & tensor_names) { |
There was a problem hiding this comment.
extracted the initialization of state counters so I can use them outside the main function
|
|
||
| default_type = llama_ftype_get_default_type(ftype); | ||
| ggml_type default_type = llama_ftype_get_default_type(ftype); | ||
| if (default_type == GGML_TYPE_COUNT) { |
There was a problem hiding this comment.
(this is where I added the throw back)
| struct ftype_name_entry { | ||
| const char * name; | ||
| llama_ftype ftype; | ||
| }; | ||
|
|
||
| static const ftype_name_entry ftype_name_table[] = { | ||
| { "F32", LLAMA_FTYPE_ALL_F32 }, | ||
| { "F16", LLAMA_FTYPE_MOSTLY_F16 }, | ||
| { "BF16", LLAMA_FTYPE_MOSTLY_BF16 }, | ||
| { "Q4_0", LLAMA_FTYPE_MOSTLY_Q4_0 }, | ||
| { "Q4_1", LLAMA_FTYPE_MOSTLY_Q4_1 }, | ||
| { "Q5_0", LLAMA_FTYPE_MOSTLY_Q5_0 }, | ||
| { "Q5_1", LLAMA_FTYPE_MOSTLY_Q5_1 }, | ||
| { "Q8_0", LLAMA_FTYPE_MOSTLY_Q8_0 }, | ||
| { "Q2_K", LLAMA_FTYPE_MOSTLY_Q2_K }, | ||
| { "Q2_K_S", LLAMA_FTYPE_MOSTLY_Q2_K_S }, | ||
| { "Q3_K_S", LLAMA_FTYPE_MOSTLY_Q3_K_S }, | ||
| { "Q3_K_M", LLAMA_FTYPE_MOSTLY_Q3_K_M }, | ||
| { "Q3_K_L", LLAMA_FTYPE_MOSTLY_Q3_K_L }, | ||
| { "Q4_K_S", LLAMA_FTYPE_MOSTLY_Q4_K_S }, | ||
| { "Q4_K_M", LLAMA_FTYPE_MOSTLY_Q4_K_M }, | ||
| { "Q5_K_S", LLAMA_FTYPE_MOSTLY_Q5_K_S }, | ||
| { "Q5_K_M", LLAMA_FTYPE_MOSTLY_Q5_K_M }, | ||
| { "Q6_K", LLAMA_FTYPE_MOSTLY_Q6_K }, | ||
| { "IQ1_S", LLAMA_FTYPE_MOSTLY_IQ1_S }, | ||
| { "IQ1_M", LLAMA_FTYPE_MOSTLY_IQ1_M }, | ||
| { "IQ2_XXS", LLAMA_FTYPE_MOSTLY_IQ2_XXS }, | ||
| { "IQ2_XS", LLAMA_FTYPE_MOSTLY_IQ2_XS }, | ||
| { "IQ2_S", LLAMA_FTYPE_MOSTLY_IQ2_S }, | ||
| { "IQ2_M", LLAMA_FTYPE_MOSTLY_IQ2_M }, | ||
| { "IQ3_XXS", LLAMA_FTYPE_MOSTLY_IQ3_XXS }, | ||
| { "IQ3_XS", LLAMA_FTYPE_MOSTLY_IQ3_XS }, | ||
| { "IQ3_S", LLAMA_FTYPE_MOSTLY_IQ3_S }, | ||
| { "IQ3_M", LLAMA_FTYPE_MOSTLY_IQ3_M }, | ||
| { "IQ4_NL", LLAMA_FTYPE_MOSTLY_IQ4_NL }, | ||
| { "IQ4_XS", LLAMA_FTYPE_MOSTLY_IQ4_XS }, | ||
| { "TQ1_0", LLAMA_FTYPE_MOSTLY_TQ1_0 }, | ||
| { "TQ2_0", LLAMA_FTYPE_MOSTLY_TQ2_0 }, | ||
| { "MXFP4_MOE", LLAMA_FTYPE_MOSTLY_MXFP4_MOE }, | ||
| }; | ||
|
|
||
| llama_ftype llama_ftype_from_name(const char * name) { | ||
| for (const auto & e : ftype_name_table) { | ||
| if (strcmp(name, e.name) == 0) { | ||
| return e.ftype; | ||
| } |
There was a problem hiding this comment.
Can this logic reasonably be deduplicated with respect to static std::string llama_model_ftype_name from src/llama-model-loader.cpp?
If not, can it be made identical?
| return (llama_ftype)-1; | ||
| } | ||
|
|
||
| const char * llama_ftype_to_name(llama_ftype ftype) { |
| // compute tensor metadata once and cache it | ||
| std::vector<tensor_metadata> metadata(tensors.size()); | ||
| for (size_t i = 0; i < tensors.size(); ++i) { | ||
| metadata[i].name = ggml_get_name(tensors[i]->tensor); |
There was a problem hiding this comment.
| metadata[i].name = ggml_get_name(tensors[i]->tensor); | |
| metadata[i].name = tensors[i]->tensor->name; |
Nitpick, not sure if it matters, but calling ggml_get_name is generally unnecessary where we can just do tensor->name.
This is part of a larger goal of reworking or replacing the
llama_tensor_get_typefunctionBefore major work starts in that area, I want to capture the current existing behaviour thoroughly, so that any accidental changes are easy to spot, and any purposeful changes are easy to document
To that end, this PR introduces unit test coverage for the function itself
Using a pre-set list of models, and taking advantage of the new
gguf-model-datautility, these tests pull real model metadata directly from huggingface, create mock models/tensors, and runs them through thellama_tensor_get_typefunction and document the schema into thetests/snapshots/directory asmodel-name.schemaI hope that the storage method I've decided to use won't be overly burdensome, I iterated a lot through various ways to store the existing tensor layouts, and I think this landed in an acceptable compromise area: not too many files, "only" 700 KB of storage
If this is not acceptable, I can go back to the drawing board
To capture current layouts of tensors, the script can be run with the
--generateflag which will download the metadata and produce the schema files:`--generate output`
Then you can run the test itself without any arguments:
`test output`
There are a few changes to llama-quant.cpp, notably making the function itself non-static so I can access it, extracting
init_quantize_state_countersandllama_ftype_default_typeso I can use them without recreating them, and lastly I pulled in the change from @ddh0 in their PR #19770 that addstensor_allows_quantization(except I made it non-static) since I needed a function like that, so once that PR is merged (and this is rebased on latest) that change will go awayI also moved a couple things like
quantize_state_implto the header file for similar reasonsFinally, I found an issue with the
gguf-model-datagguf_read_uint32_valwhen the per-layer head counts are an array like with Step-3.5-Flash, so I've added a fix for that and a unit test to capture that behaviour