llama: end-to-end tests by JohannesGaessler · Pull Request #19802 · ggml-org/llama.cpp

JohannesGaessler · 2026-02-22T10:59:49Z

Edit: sorry, I misclicked and accidentally opened the PR without a description.

This PR aims to implement end-to-end tests for model inference comparable to test-backend-ops except by creating toy models with random data. This would allow us to assert that the llama.cpp code runs without crashing and that the results between different backends are consistent. For me the main use case will be quality control in the context of #19378 to more easily test whether the meta backend works correctly for the model archs we support. We can also check whether a roundtrip with llama_model_save_to_file and llama_model_load_from_file works correctly. long-term, if the training code works reliably for all model archs we can also train small toy models to overfit on short texts and assert that they can correctly predict it afterwards.

The test code works as follows: the llama API is extended with a new function llama_model_init that accepts a GGUF context with model metadata as well as a user-defined function to set the data of an individual tensor. Internally this then hijacks llama_model_loader to initialize the model weights with the passed function rather than by loading the weights from disk. The new code in test-llama-archs.cpp is currently just a PoC - I intend to iterate over the values in llama-arch.h for the final tests - long-term I think it may make sense to expose enums such as llm_arch to facilitate easier model creation via the llama API.

@ggerganov @CISC if this PoC does not meet your requirements, please tell me now when I haven't yet invested much time.

ggerganov

As a public API this is OK.

I think the internal implementation should be improved. I assume it's just a PoC for now, so that is OK.

My main recommendation is to:

Encapsulate better the tensor loading logic in the llama_model_loader class
Abstract different ways of loading tensor data and avoid leaking any of that logic into llama_model

If you have doubts how to do that, I can make a pass on this part of the code - LMK.

ggerganov · 2026-02-23T08:02:17Z

src/llama-model-loader.cpp

-    meta.reset(gguf_init_from_file(fname.c_str(), params));
-    if (!meta) {
+    metadata_ptr.reset(gguf_init_from_file(fname.c_str(), params));
+    metadata = metadata_ptr.get();


The naming here is problematic - you actually want to do:

Suggested change

metadata = metadata_ptr.get();

this->metadata = metadata_ptr.get();

As it is, it sets the input argument instead.

ggerganov · 2026-02-23T08:02:52Z

include/llama.h

+
+    // Create a new model from GGUF metadata as well as a function to set the tensor data
+    LLAMA_API struct llama_model * llama_model_init(
+                    struct gguf_context * metadata,
+          llama_model_set_tensor_data_t   set_tensor_data,    // function to initialize tensor data with
+                                   void * set_tensor_data_ud, // userdata for function
+              struct llama_model_params   params);
+


Likely should be called llama_model_init_from_user

ggerganov · 2026-02-23T08:04:54Z

src/llama-model.cpp

        auto create_tensor = [&](const LLM_TN_IMPL & tn, const std::initializer_list<int64_t> & ne, int flags) -> ggml_tensor * {
+            if (ml.files.empty()) {
+                llm_tensor_info info;
+                try {
+                    info = llm_tensor_info_for(tn.tensor);
+                } catch (const std::out_of_range & e) {
+                    throw std::runtime_error(format("missing tensor info mapping for %s", tn.str().c_str()));
+                }
+                buft_list_t * buft_list;
+                switch (info.layer) {
+                    case LLM_TENSOR_LAYER_INPUT:
+                        buft_list = pimpl->dev_input.buft_list;
+                        break;
+                    case LLM_TENSOR_LAYER_OUTPUT:
+                        buft_list = pimpl->dev_output.buft_list;
+                        break;
+                    case LLM_TENSOR_LAYER_REPEATING:
+                        buft_list = pimpl->dev_layer.at(tn.bid).buft_list;
+                        break;
+                    default:
+                        GGML_ABORT("fatl error");
+                }
+                ggml_backend_buffer_type_t buft = buft_list->at(0).second;
+                ggml_context * ctx = ctx_for_buft(buft);
+                ggml_tensor * ret = ggml_new_tensor(ctx, GGML_TYPE_F32, ne.size(), std::data(ne));
+                std::string name = tn;
+                ggml_set_name(ret, name.c_str());
+                return ret;
+            }


Ideally, this logic should not be needed. The data loading should be completely abstracted by llama_model_loader and the logic in llama_model should not need to handle different ways of tensor data input explicitly.

CISC · 2026-02-23T08:40:43Z

It wasn't super clear, but is the thinking that we then create a set of tensors based on hardcoded tensor parameters for that model?

Is it perhaps possible to utilize the model loading stage to get these parameters OTF (or in a generation pass)?

JohannesGaessler · 2026-02-23T12:37:33Z

I moved the code where changes would need to be made from llama-model.cpp into llama-model-loader.cpp. Thinking about how to test for correctness, what I think we should do is test for changes in the outputs instead. We can extend test-llama-archs to optionally write the generated model + the generated outputs to disk. We can then in another pass load the model from disk and compare the outputs vs. the outputs produced with a previous software version. With the minimal parameters I pushed the size of an individual model at FP32 precision seems to be ~30 kB.

It wasn't super clear, but is the thinking that we then create a set of tensors based on hardcoded tensor parameters for that model? Is it perhaps possible to utilize the model loading stage to get these parameters OTF (or in a generation pass)?

I'm not sure I understand your question. My goal is to produce tiny models that can be evaluated with minimal resources for testing specific language model architectures. I'm currently hardcoding the metadata like n_embd that would normally be found in a GGUF file in test-llama-archs.cpp but it could easily be made configurable. It needs to be fully specified before creating the llama_model though. I've been thinking it would make sense to look at test-backend-ops.cpp and to factor out functionality that can be re-used into something like test-common.h.

CISC · 2026-02-23T12:58:38Z

It wasn't super clear, but is the thinking that we then create a set of tensors based on hardcoded tensor parameters for that model? Is it perhaps possible to utilize the model loading stage to get these parameters OTF (or in a generation pass)?

I'm not sure I understand your question. My goal is to produce tiny models that can be evaluated with minimal resources for testing specific language model architectures. I'm currently hardcoding the metadata like n_embd that would normally be found in a GGUF file in test-llama-archs.cpp but it could easily be made configurable. It needs to be fully specified before creating the llama_model though. I've been thinking it would make sense to look at test-backend-ops.cpp and to factor out functionality that can be re-used into something like test-common.h.

Right, so what I was thinking is that this can be retrieved from the tensor loading stage of an arch.

JohannesGaessler · 2026-02-23T13:07:53Z

If a model is already on disk or with #19796 it could be done but for my usecase I think it wouldn't make sense. However, if we want to test for changes to the outputs of real models I would just write the new code in such a way that the following things can be done separately:

Create and save a dummy model.
Load a model from disk, calulcate outputs, write outputs to disk.
Load a model and outputs from disk, calculate outputs, compare calculated outputs vs. disk.

JohannesGaessler · 2026-02-23T23:59:26Z

I pushed a version that has rudimentary support for generating models as well as testing whether results have changed. The CLI usage looks something like this:

./build/bin/test-llama-archs gen-model --model model.gguf
./build/bin/test-llama-archs gen-results --model model.gguf --results results.gguf
./build/bin/test-llama-archs test-vs-disk --model model.gguf --results results.gguf

export mn=llama_3-8b && export q=q4_0
./build/bin/test-llama-archs gen-results --model models/opt/${mn}-${q}.gguf --results results.gguf
./build/bin/test-llama-archs test-vs-disk --model models/opt/${mn}-${q}.gguf --results results.gguf

I'm thinking it would make sense to write a script for automating git bisect using this test.

tests/test-llama-archs.cpp

src/llama.cpp

JohannesGaessler · 2026-02-25T12:55:37Z

I implemented for the new code to re-use the buffer type selection logic from the preexisting code by creating a tensor with the expected dimensions on-the-fly. This then causes the embedding tensor to remain on CPU vs. my previous iteration where I just hard-coded all tensors to go to one buffer type. So then I ran into the same issue you did and I took over the fix you posted. Please let me know if you make further changes to the related code but for a model created in-memory I think setting things like mmap to false is correct either way though.

ggerganov · 2026-02-25T13:04:14Z

I think we should keep what you have now until we merge. After that, we should consider 2 major refactors in libllama:

Abstract the llama_model_loader implementation to avoid branching such as if (files.empty()) { } else { }. Instead, these should be 2 separate implementation of the loader - one that reads data from files and one that uses user-provided data
Expose API to extract the weight tensors per arch so that we can construct the GGUF metadata in the user code as you suggested

For now, we can accept the hacky way in order to unblock work that depends on the ability to generate small models.

JohannesGaessler · 2026-02-25T13:58:40Z

Expose API to extract the weight tensors per arch so that we can construct the GGUF metadata in the user code as you suggested

To be clear, if in the user code such tensors were to be generated they would as of right now not be used as the actual weight tensors of the model. The model loader generates a bunch of empty tensors from the GGUF file, the code in llama-model.cpp then provides expected tensor shapes, compares them against what the model loader found, and creates its own tensors in its own ggml contexts. Do you mean that the user explicitly creates weight tensors that a model should then directly take over as its weight tensors?

JohannesGaessler · 2026-02-25T14:17:41Z

@ggerganov @CISC do you have an opinion on whether the current "modes" of test-llama-archs (other than the one that would iterate over all architectures) should be made into standalone binaries? I've been thinking that in order to properly automate git bisect the code should just have all of the functionality that we give regular examples/tools in order to properly set things like -ot.

CISC · 2026-02-25T14:25:21Z

I have no particular opinion, but there's no code-sharing between the modes, so splitting them up would make sense.

JohannesGaessler · 2026-02-25T23:39:14Z

I pushed a first version for systematically testing our supported model architectures with in-memory models. The test with CUDA vs. CPU is taking ~1.5 seconds with ~50% of the models supported. I made some KV retrievals optional if the inference code supports it.

I think there is a bug in the MBT code regarding how a view is calculated, I'll need to get a real model to check how the tensor shapes are actually supposed to look like. For LLAMA_EMBED CPU and CUDA yield inconsistent results.

JohannesGaessler · 2026-03-01T22:25:53Z

I've pushed a version that tests >90% of our supported models and that provides a new binary llama-results to --check whether results have changed. From my end I will write a simple script for automatic git bisect, go over my changes one more time, and write a proper changelog. After that I think this PR will be ready for review.

JohannesGaessler · 2026-03-03T15:29:33Z

From my end I would now consider this PR ready for review.

I extended the llama API with a function llama_model_init_from_user to create a model from a GGUF context.
I added a new binary llama-test-archs that tests whether toy models for a given LLM_ARCH produce consistent results across ggml backend devices. I re-used and extended llama_model_saver to populate the GGUF contexts.
I added a new binary llama-results that writes a --model's logits for a --prompt to a given --output GGUF file. Existing results can be --checked against a re-run. The script under scripts/git-bisect.sh can be used to determine when a model's outputs have changed.
The inputs for llama models are allocated via the ggml_cgraph allocation. However, if any of the llama graph inputs are not used (e.g. because a model has a hard-coded period of layer types) llama_decode will crash when trying to set the input. Very tedious to debug because it's difficult to determine from the error message why this is happening. I suspect that if we keep it like this it will result in a lot of wasted contributor time going forward. Not sure what would be the best way to fix this, for now I've added a comment to make debugging easier.
While implementing the tests I modified some of the model loading code. Mostly I just made some hardcoded arguments configurable (like the number and size of layers) or I made them optional for all architectures (when they previously were for only a subset of architectures) if there is a meaningful fallback. For Kimi Linear I changed the logic for how to determine whether a layer is MLA or KDA (consistently use hparams::is_recurrent). With the code on master it is possible to construct an inconsistent Kimi Linear GGUF file that segfaults. There should be no change for the actual model.
llama_model_saver is currently broken in the context of llama_model_save_to_file. For this reason the corresponding functionality in test-llama-archs to save the generated models to disk is disabled. I intend to fix this going forward.
As Georgi said, going foward replace the conditional statements using files.empty() with template methods and expose things like our architectures and KV to user code.

JohannesGaessler · 2026-03-06T13:57:35Z

The macOS-latest-cmake-arm64 CI failure seems to be due to a segmentation fault. I cannot reproduce this segfault locally and I don't know what would be causing it. @ggerganov can you reproduce the issue or at least tell me how to disable these tests in the CI for now?

ggerganov · 2026-03-06T14:12:40Z

I am able to reproduce - looking into it.

ggerganov · 2026-03-06T19:34:46Z

It will take me a bit of time to fix this - I think it is a problem in the Metal backend when it uses a Paravirtual Metal device (i.e. what the Github runner uses).

For now, you can workaround using this patch:

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 30365a361..c219cc237 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -93,7 +93,7 @@ jobs:
         id: cmake_test
         run: |
           cd build
-          ctest -L main --verbose --timeout 900
+          ctest -L main -E "test-llama-arch" --verbose --timeout 900
 
   macOS-latest-cmake-x64:
     runs-on: macos-15-intel

I'll try to fix this properly later in a follow-up PR.

src/models/plamo2.cpp

src/llama-model.cpp

JohannesGaessler · 2026-03-07T11:00:24Z

Regarding the expert weight scale: is there a reason why we have separate arguments for whether or not to scale the weights and also for the scale itself? It seems to me like we could just do something like if (w_scale != 0.0f && scale_w != 1.0f) {...}.

CISC · 2026-03-07T11:16:23Z

Regarding the expert weight scale: is there a reason why we have separate arguments for whether or not to scale the weights and also for the scale itself? It seems to me like we could just do something like if (w_scale != 0.0f && scale_w != 1.0f) {...}.

I'm guessing it's legacy carry-over, don't see why it couldn't just be a check like that in build_moe_ffn. @ggerganov

JohannesGaessler · 2026-03-07T11:21:48Z

Anyways, thanks for zapping me regarding the scales. For now I've pushed a version that uses the scales conditionally. While checking the usage I noticed though that NEMOTRON_H and EXAONE_MOE are already suffering from this same issue on master. Which I would see as an argument in favor of not explicitly specifying bool scale_w in llm_graph_context::build_moe_ffn since it increases the risk of inconsistencies.

ggerganov · 2026-03-08T10:35:27Z

Yes, the bool scale_w seems redundant - probably some legacy leftover. Add TODO to remove it.

CISC · 2026-03-08T10:41:34Z

Yes, the bool scale_w seems redundant - probably some legacy leftover. Add TODO to remove it.

I'll make a PR after this is merged.

CISC · 2026-03-09T14:01:41Z

@jeffbolznv @0cc4m We seem to have some semi-random failures on Vulkan:
https://github.com/ggml-org/llama.cpp/actions/runs/22841132052/job/66247374061#step:9:8995
https://github.com/ggml-org/llama.cpp/actions/runs/22841132052/job/66247374061#step:9:9042
https://github.com/ggml-org/llama.cpp/actions/runs/22844897991/job/66258833638#step:3:1204

JohannesGaessler · 2026-03-09T14:09:34Z

Without knowing anything about the implementation details of Vulkan, I think most likely this is the result of numerical instability.

jeffbolznv · 2026-03-09T15:52:04Z

The failure reproduces reliably for me with coopmat2 disabled, e.g. using test-llama-archs.exe -a gpt-oss. Happens with the scalar or coopmat1 path, and goes away if I make FA fall back to CPU. @0cc4m can you look into this?

0cc4m · 2026-03-09T15:55:26Z

I'll look into it, yes.

0cc4m · 2026-03-09T16:08:32Z

It passes all models reliably for me on AMD 8060S with or without coopmat, and on Intel Meteor Lake. It also passes on llvmpipe without coopmat.

jeffbolznv · 2026-03-09T19:35:42Z

@0cc4m are you saying you can't reproduce it at all? Not even on RTX 3090?

0cc4m · 2026-03-09T20:43:09Z

Yes, I've now also tested on the RTX 3090 as well, couldn't reproduce it without coopmat, or with coopmat1 or 2. I haven't seen a single failure yet, except llvmpipe with coopmat.

jeffbolznv · 2026-03-09T20:56:54Z

OK, I'll debug it a bit.

* tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments

JohannesGaessler requested review from CISC and ggerganov as code owners February 22, 2026 10:59

github-actions bot added the testing Everything test related label Feb 22, 2026

JohannesGaessler marked this pull request as draft February 22, 2026 11:00

JohannesGaessler changed the title ~~Llama e2e tests~~ llama: end-to-end tests Feb 22, 2026

JohannesGaessler mentioned this pull request Feb 22, 2026

Add model metadata loading from huggingface for use with tests requiring real model data #19796

Merged

danbev mentioned this pull request Feb 23, 2026

llama : remove write/read of output ids/logits/embeddings #18862

Merged

ggerganov reviewed Feb 23, 2026

View reviewed changes

ggerganov reviewed Feb 25, 2026

View reviewed changes

tests/test-llama-archs.cpp Outdated Show resolved Hide resolved

ggerganov reviewed Feb 25, 2026

View reviewed changes

src/llama.cpp Show resolved Hide resolved

github-actions bot added model Model specific examples labels Feb 25, 2026

JohannesGaessler force-pushed the llama-e2e-tests branch from 38d2885 to afa3d70 Compare March 3, 2026 15:28

JohannesGaessler marked this pull request as ready for review March 3, 2026 15:29

github-actions bot added the script Script related label Mar 3, 2026

JohannesGaessler mentioned this pull request Mar 3, 2026

ggml: fix ggml_is_contiguous_n for ne == 1 #20092

Merged

ggerganov self-requested a review March 4, 2026 15:13

JohannesGaessler added 3 commits March 6, 2026 09:52

fix CI

7e46607

fix WebGPU

b90486e

fix CI

803d3a1

JohannesGaessler force-pushed the llama-e2e-tests branch from bbd4e0a to 803d3a1 Compare March 6, 2026 09:16

disable CI for macOS-latest-cmake-arm64

d6308b9

github-actions bot added the devops improvements to build systems and github actions label Mar 6, 2026

CISC approved these changes Mar 6, 2026

View reviewed changes

use expert_weights_scale only if != 0.0f

93557f5

comments [no ci]

54a5782

JohannesGaessler merged commit a976ff0 into ggml-org:master Mar 8, 2026
1 check passed

ggerganov mentioned this pull request Mar 9, 2026

models : fix assert in mamba2 graph #20270

Merged

am17an mentioned this pull request Mar 25, 2026

llama-model-loader: print warning when using overrides with mmap #20978

Merged

	metadata = metadata_ptr.get();
	this->metadata = metadata_ptr.get();

Conversation

JohannesGaessler commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

CISC commented Feb 23, 2026

Uh oh!

JohannesGaessler commented Feb 23, 2026

Uh oh!

CISC commented Feb 23, 2026

Uh oh!

JohannesGaessler commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Feb 25, 2026

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

JohannesGaessler commented Feb 25, 2026

Uh oh!

JohannesGaessler commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Feb 25, 2026

Uh oh!

JohannesGaessler commented Feb 25, 2026

Uh oh!

JohannesGaessler commented Mar 1, 2026

Uh oh!

JohannesGaessler commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Mar 6, 2026

Uh oh!

ggerganov commented Mar 6, 2026

Uh oh!

ggerganov commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Mar 7, 2026

Uh oh!

CISC commented Mar 7, 2026

Uh oh!

JohannesGaessler commented Mar 7, 2026

Uh oh!

ggerganov commented Mar 8, 2026

Uh oh!

CISC commented Mar 8, 2026

Uh oh!

Uh oh!

CISC commented Mar 9, 2026

Uh oh!

JohannesGaessler commented Mar 9, 2026

Uh oh!

jeffbolznv commented Mar 9, 2026

Uh oh!

0cc4m commented Mar 9, 2026

Uh oh!

0cc4m commented Mar 9, 2026

JohannesGaessler commented Feb 22, 2026 •

edited

Loading

JohannesGaessler commented Feb 23, 2026 •

edited

Loading

JohannesGaessler commented Feb 25, 2026 •

edited

Loading

JohannesGaessler commented Mar 3, 2026 •

edited

Loading