Skip to content

llama: end-to-end tests#19802

Merged
JohannesGaessler merged 9 commits intoggml-org:masterfrom
JohannesGaessler:llama-e2e-tests
Mar 8, 2026
Merged

llama: end-to-end tests#19802
JohannesGaessler merged 9 commits intoggml-org:masterfrom
JohannesGaessler:llama-e2e-tests

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler commented Feb 22, 2026

Edit: sorry, I misclicked and accidentally opened the PR without a description.

This PR aims to implement end-to-end tests for model inference comparable to test-backend-ops except by creating toy models with random data. This would allow us to assert that the llama.cpp code runs without crashing and that the results between different backends are consistent. For me the main use case will be quality control in the context of #19378 to more easily test whether the meta backend works correctly for the model archs we support. We can also check whether a roundtrip with llama_model_save_to_file and llama_model_load_from_file works correctly. long-term, if the training code works reliably for all model archs we can also train small toy models to overfit on short texts and assert that they can correctly predict it afterwards.

The test code works as follows: the llama API is extended with a new function llama_model_init that accepts a GGUF context with model metadata as well as a user-defined function to set the data of an individual tensor. Internally this then hijacks llama_model_loader to initialize the model weights with the passed function rather than by loading the weights from disk. The new code in test-llama-archs.cpp is currently just a PoC - I intend to iterate over the values in llama-arch.h for the final tests - long-term I think it may make sense to expose enums such as llm_arch to facilitate easier model creation via the llama API.

@ggerganov @CISC if this PoC does not meet your requirements, please tell me now when I haven't yet invested much time.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a public API this is OK.

I think the internal implementation should be improved. I assume it's just a PoC for now, so that is OK.

My main recommendation is to:

  • Encapsulate better the tensor loading logic in the llama_model_loader class
  • Abstract different ways of loading tensor data and avoid leaking any of that logic into llama_model

If you have doubts how to do that, I can make a pass on this part of the code - LMK.

meta.reset(gguf_init_from_file(fname.c_str(), params));
if (!meta) {
metadata_ptr.reset(gguf_init_from_file(fname.c_str(), params));
metadata = metadata_ptr.get();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming here is problematic - you actually want to do:

Suggested change
metadata = metadata_ptr.get();
this->metadata = metadata_ptr.get();

As it is, it sets the input argument instead.

Comment on lines +445 to +452

// Create a new model from GGUF metadata as well as a function to set the tensor data
LLAMA_API struct llama_model * llama_model_init(
struct gguf_context * metadata,
llama_model_set_tensor_data_t set_tensor_data, // function to initialize tensor data with
void * set_tensor_data_ud, // userdata for function
struct llama_model_params params);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely should be called llama_model_init_from_user

Comment on lines +2827 to +2855
auto create_tensor = [&](const LLM_TN_IMPL & tn, const std::initializer_list<int64_t> & ne, int flags) -> ggml_tensor * {
if (ml.files.empty()) {
llm_tensor_info info;
try {
info = llm_tensor_info_for(tn.tensor);
} catch (const std::out_of_range & e) {
throw std::runtime_error(format("missing tensor info mapping for %s", tn.str().c_str()));
}
buft_list_t * buft_list;
switch (info.layer) {
case LLM_TENSOR_LAYER_INPUT:
buft_list = pimpl->dev_input.buft_list;
break;
case LLM_TENSOR_LAYER_OUTPUT:
buft_list = pimpl->dev_output.buft_list;
break;
case LLM_TENSOR_LAYER_REPEATING:
buft_list = pimpl->dev_layer.at(tn.bid).buft_list;
break;
default:
GGML_ABORT("fatl error");
}
ggml_backend_buffer_type_t buft = buft_list->at(0).second;
ggml_context * ctx = ctx_for_buft(buft);
ggml_tensor * ret = ggml_new_tensor(ctx, GGML_TYPE_F32, ne.size(), std::data(ne));
std::string name = tn;
ggml_set_name(ret, name.c_str());
return ret;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this logic should not be needed. The data loading should be completely abstracted by llama_model_loader and the logic in llama_model should not need to handle different ways of tensor data input explicitly.

@CISC
Copy link
Copy Markdown
Member

CISC commented Feb 23, 2026

It wasn't super clear, but is the thinking that we then create a set of tensors based on hardcoded tensor parameters for that model?

Is it perhaps possible to utilize the model loading stage to get these parameters OTF (or in a generation pass)?

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I moved the code where changes would need to be made from llama-model.cpp into llama-model-loader.cpp. Thinking about how to test for correctness, what I think we should do is test for changes in the outputs instead. We can extend test-llama-archs to optionally write the generated model + the generated outputs to disk. We can then in another pass load the model from disk and compare the outputs vs. the outputs produced with a previous software version. With the minimal parameters I pushed the size of an individual model at FP32 precision seems to be ~30 kB.

It wasn't super clear, but is the thinking that we then create a set of tensors based on hardcoded tensor parameters for that model? Is it perhaps possible to utilize the model loading stage to get these parameters OTF (or in a generation pass)?

I'm not sure I understand your question. My goal is to produce tiny models that can be evaluated with minimal resources for testing specific language model architectures. I'm currently hardcoding the metadata like n_embd that would normally be found in a GGUF file in test-llama-archs.cpp but it could easily be made configurable. It needs to be fully specified before creating the llama_model though. I've been thinking it would make sense to look at test-backend-ops.cpp and to factor out functionality that can be re-used into something like test-common.h.

@CISC
Copy link
Copy Markdown
Member

CISC commented Feb 23, 2026

It wasn't super clear, but is the thinking that we then create a set of tensors based on hardcoded tensor parameters for that model? Is it perhaps possible to utilize the model loading stage to get these parameters OTF (or in a generation pass)?

I'm not sure I understand your question. My goal is to produce tiny models that can be evaluated with minimal resources for testing specific language model architectures. I'm currently hardcoding the metadata like n_embd that would normally be found in a GGUF file in test-llama-archs.cpp but it could easily be made configurable. It needs to be fully specified before creating the llama_model though. I've been thinking it would make sense to look at test-backend-ops.cpp and to factor out functionality that can be re-used into something like test-common.h.

Right, so what I was thinking is that this can be retrieved from the tensor loading stage of an arch.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

JohannesGaessler commented Feb 23, 2026

If a model is already on disk or with #19796 it could be done but for my usecase I think it wouldn't make sense. However, if we want to test for changes to the outputs of real models I would just write the new code in such a way that the following things can be done separately:

  • Create and save a dummy model.
  • Load a model from disk, calulcate outputs, write outputs to disk.
  • Load a model and outputs from disk, calculate outputs, compare calculated outputs vs. disk.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I pushed a version that has rudimentary support for generating models as well as testing whether results have changed. The CLI usage looks something like this:

./build/bin/test-llama-archs gen-model --model model.gguf
./build/bin/test-llama-archs gen-results --model model.gguf --results results.gguf
./build/bin/test-llama-archs test-vs-disk --model model.gguf --results results.gguf

export mn=llama_3-8b && export q=q4_0
./build/bin/test-llama-archs gen-results --model models/opt/${mn}-${q}.gguf --results results.gguf
./build/bin/test-llama-archs test-vs-disk --model models/opt/${mn}-${q}.gguf --results results.gguf

I'm thinking it would make sense to write a script for automating git bisect using this test.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I implemented for the new code to re-use the buffer type selection logic from the preexisting code by creating a tensor with the expected dimensions on-the-fly. This then causes the embedding tensor to remain on CPU vs. my previous iteration where I just hard-coded all tensors to go to one buffer type. So then I ran into the same issue you did and I took over the fix you posted. Please let me know if you make further changes to the related code but for a model created in-memory I think setting things like mmap to false is correct either way though.

@ggerganov
Copy link
Copy Markdown
Member

I think we should keep what you have now until we merge. After that, we should consider 2 major refactors in libllama:

  • Abstract the llama_model_loader implementation to avoid branching such as if (files.empty()) { } else { }. Instead, these should be 2 separate implementation of the loader - one that reads data from files and one that uses user-provided data
  • Expose API to extract the weight tensors per arch so that we can construct the GGUF metadata in the user code as you suggested

For now, we can accept the hacky way in order to unblock work that depends on the ability to generate small models.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Expose API to extract the weight tensors per arch so that we can construct the GGUF metadata in the user code as you suggested

To be clear, if in the user code such tensors were to be generated they would as of right now not be used as the actual weight tensors of the model. The model loader generates a bunch of empty tensors from the GGUF file, the code in llama-model.cpp then provides expected tensor shapes, compares them against what the model loader found, and creates its own tensors in its own ggml contexts. Do you mean that the user explicitly creates weight tensors that a model should then directly take over as its weight tensors?

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

JohannesGaessler commented Feb 25, 2026

@ggerganov @CISC do you have an opinion on whether the current "modes" of test-llama-archs (other than the one that would iterate over all architectures) should be made into standalone binaries? I've been thinking that in order to properly automate git bisect the code should just have all of the functionality that we give regular examples/tools in order to properly set things like -ot.

@CISC
Copy link
Copy Markdown
Member

CISC commented Feb 25, 2026

I have no particular opinion, but there's no code-sharing between the modes, so splitting them up would make sense.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I pushed a first version for systematically testing our supported model architectures with in-memory models. The test with CUDA vs. CPU is taking ~1.5 seconds with ~50% of the models supported. I made some KV retrievals optional if the inference code supports it.

I think there is a bug in the MBT code regarding how a view is calculated, I'll need to get a real model to check how the tensor shapes are actually supposed to look like. For LLAMA_EMBED CPU and CUDA yield inconsistent results.

@github-actions github-actions bot added model Model specific examples labels Feb 25, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I've pushed a version that tests >90% of our supported models and that provides a new binary llama-results to --check whether results have changed. From my end I will write a simple script for automatic git bisect, go over my changes one more time, and write a proper changelog. After that I think this PR will be ready for review.

@JohannesGaessler JohannesGaessler marked this pull request as ready for review March 3, 2026 15:29
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

JohannesGaessler commented Mar 3, 2026

From my end I would now consider this PR ready for review.

  • I extended the llama API with a function llama_model_init_from_user to create a model from a GGUF context.
  • I added a new binary llama-test-archs that tests whether toy models for a given LLM_ARCH produce consistent results across ggml backend devices. I re-used and extended llama_model_saver to populate the GGUF contexts.
  • I added a new binary llama-results that writes a --model's logits for a --prompt to a given --output GGUF file. Existing results can be --checked against a re-run. The script under scripts/git-bisect.sh can be used to determine when a model's outputs have changed.
  • The inputs for llama models are allocated via the ggml_cgraph allocation. However, if any of the llama graph inputs are not used (e.g. because a model has a hard-coded period of layer types) llama_decode will crash when trying to set the input. Very tedious to debug because it's difficult to determine from the error message why this is happening. I suspect that if we keep it like this it will result in a lot of wasted contributor time going forward. Not sure what would be the best way to fix this, for now I've added a comment to make debugging easier.
  • While implementing the tests I modified some of the model loading code. Mostly I just made some hardcoded arguments configurable (like the number and size of layers) or I made them optional for all architectures (when they previously were for only a subset of architectures) if there is a meaningful fallback. For Kimi Linear I changed the logic for how to determine whether a layer is MLA or KDA (consistently use hparams::is_recurrent). With the code on master it is possible to construct an inconsistent Kimi Linear GGUF file that segfaults. There should be no change for the actual model.
  • llama_model_saver is currently broken in the context of llama_model_save_to_file. For this reason the corresponding functionality in test-llama-archs to save the generated models to disk is disabled. I intend to fix this going forward.
  • As Georgi said, going foward replace the conditional statements using files.empty() with template methods and expose things like our architectures and KV to user code.

@github-actions github-actions bot added the script Script related label Mar 3, 2026
@ggerganov ggerganov self-requested a review March 4, 2026 15:13
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

The macOS-latest-cmake-arm64 CI failure seems to be due to a segmentation fault. I cannot reproduce this segfault locally and I don't know what would be causing it. @ggerganov can you reproduce the issue or at least tell me how to disable these tests in the CI for now?

@ggerganov
Copy link
Copy Markdown
Member

I am able to reproduce - looking into it.

@ggerganov
Copy link
Copy Markdown
Member

It will take me a bit of time to fix this - I think it is a problem in the Metal backend when it uses a Paravirtual Metal device (i.e. what the Github runner uses).

For now, you can workaround using this patch:

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 30365a361..c219cc237 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -93,7 +93,7 @@ jobs:
         id: cmake_test
         run: |
           cd build
-          ctest -L main --verbose --timeout 900
+          ctest -L main -E "test-llama-arch" --verbose --timeout 900
 
   macOS-latest-cmake-x64:
     runs-on: macos-15-intel

I'll try to fix this properly later in a follow-up PR.

@github-actions github-actions bot added the devops improvements to build systems and github actions label Mar 6, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Regarding the expert weight scale: is there a reason why we have separate arguments for whether or not to scale the weights and also for the scale itself? It seems to me like we could just do something like if (w_scale != 0.0f && scale_w != 1.0f) {...}.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 7, 2026

Regarding the expert weight scale: is there a reason why we have separate arguments for whether or not to scale the weights and also for the scale itself? It seems to me like we could just do something like if (w_scale != 0.0f && scale_w != 1.0f) {...}.

I'm guessing it's legacy carry-over, don't see why it couldn't just be a check like that in build_moe_ffn. @ggerganov

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Anyways, thanks for zapping me regarding the scales. For now I've pushed a version that uses the scales conditionally. While checking the usage I noticed though that NEMOTRON_H and EXAONE_MOE are already suffering from this same issue on master. Which I would see as an argument in favor of not explicitly specifying bool scale_w in llm_graph_context::build_moe_ffn since it increases the risk of inconsistencies.

@ggerganov
Copy link
Copy Markdown
Member

Yes, the bool scale_w seems redundant - probably some legacy leftover. Add TODO to remove it.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 8, 2026

Yes, the bool scale_w seems redundant - probably some legacy leftover. Add TODO to remove it.

I'll make a PR after this is merged.

@JohannesGaessler JohannesGaessler merged commit a976ff0 into ggml-org:master Mar 8, 2026
1 check passed
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Without knowing anything about the implementation details of Vulkan, I think most likely this is the result of numerical instability.

@jeffbolznv
Copy link
Copy Markdown
Contributor

The failure reproduces reliably for me with coopmat2 disabled, e.g. using test-llama-archs.exe -a gpt-oss. Happens with the scalar or coopmat1 path, and goes away if I make FA fall back to CPU. @0cc4m can you look into this?

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Mar 9, 2026

I'll look into it, yes.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Mar 9, 2026

It passes all models reliably for me on AMD 8060S with or without coopmat, and on Intel Meteor Lake. It also passes on llvmpipe without coopmat.

@jeffbolznv
Copy link
Copy Markdown
Contributor

@0cc4m are you saying you can't reproduce it at all? Not even on RTX 3090?

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Mar 9, 2026

Yes, I've now also tested on the RTX 3090 as well, couldn't reproduce it without coopmat, or with coopmat1 or 2. I haven't seen a single failure yet, except llvmpipe with coopmat.

@jeffbolznv
Copy link
Copy Markdown
Contributor

OK, I'll debug it a bit.

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026
* tests: add end-to-end tests per model architecture

* fixup for rebase

* fix use-after-free in llama-model-loader.cpp

* fix CI

* fix WebGPU

* fix CI

* disable CI for macOS-latest-cmake-arm64

* use expert_weights_scale only if != 0.0f

* comments
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* tests: add end-to-end tests per model architecture

* fixup for rebase

* fix use-after-free in llama-model-loader.cpp

* fix CI

* fix WebGPU

* fix CI

* disable CI for macOS-latest-cmake-arm64

* use expert_weights_scale only if != 0.0f

* comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions examples model Model specific script Script related testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants