llama: end-to-end tests#19802
Conversation
ggerganov
left a comment
There was a problem hiding this comment.
As a public API this is OK.
I think the internal implementation should be improved. I assume it's just a PoC for now, so that is OK.
My main recommendation is to:
- Encapsulate better the tensor loading logic in the
llama_model_loaderclass - Abstract different ways of loading tensor data and avoid leaking any of that logic into
llama_model
If you have doubts how to do that, I can make a pass on this part of the code - LMK.
src/llama-model-loader.cpp
Outdated
| meta.reset(gguf_init_from_file(fname.c_str(), params)); | ||
| if (!meta) { | ||
| metadata_ptr.reset(gguf_init_from_file(fname.c_str(), params)); | ||
| metadata = metadata_ptr.get(); |
There was a problem hiding this comment.
The naming here is problematic - you actually want to do:
| metadata = metadata_ptr.get(); | |
| this->metadata = metadata_ptr.get(); |
As it is, it sets the input argument instead.
|
|
||
| // Create a new model from GGUF metadata as well as a function to set the tensor data | ||
| LLAMA_API struct llama_model * llama_model_init( | ||
| struct gguf_context * metadata, | ||
| llama_model_set_tensor_data_t set_tensor_data, // function to initialize tensor data with | ||
| void * set_tensor_data_ud, // userdata for function | ||
| struct llama_model_params params); | ||
|
|
There was a problem hiding this comment.
Likely should be called llama_model_init_from_user
src/llama-model.cpp
Outdated
| auto create_tensor = [&](const LLM_TN_IMPL & tn, const std::initializer_list<int64_t> & ne, int flags) -> ggml_tensor * { | ||
| if (ml.files.empty()) { | ||
| llm_tensor_info info; | ||
| try { | ||
| info = llm_tensor_info_for(tn.tensor); | ||
| } catch (const std::out_of_range & e) { | ||
| throw std::runtime_error(format("missing tensor info mapping for %s", tn.str().c_str())); | ||
| } | ||
| buft_list_t * buft_list; | ||
| switch (info.layer) { | ||
| case LLM_TENSOR_LAYER_INPUT: | ||
| buft_list = pimpl->dev_input.buft_list; | ||
| break; | ||
| case LLM_TENSOR_LAYER_OUTPUT: | ||
| buft_list = pimpl->dev_output.buft_list; | ||
| break; | ||
| case LLM_TENSOR_LAYER_REPEATING: | ||
| buft_list = pimpl->dev_layer.at(tn.bid).buft_list; | ||
| break; | ||
| default: | ||
| GGML_ABORT("fatl error"); | ||
| } | ||
| ggml_backend_buffer_type_t buft = buft_list->at(0).second; | ||
| ggml_context * ctx = ctx_for_buft(buft); | ||
| ggml_tensor * ret = ggml_new_tensor(ctx, GGML_TYPE_F32, ne.size(), std::data(ne)); | ||
| std::string name = tn; | ||
| ggml_set_name(ret, name.c_str()); | ||
| return ret; | ||
| } |
There was a problem hiding this comment.
Ideally, this logic should not be needed. The data loading should be completely abstracted by llama_model_loader and the logic in llama_model should not need to handle different ways of tensor data input explicitly.
|
It wasn't super clear, but is the thinking that we then create a set of tensors based on hardcoded tensor parameters for that model? Is it perhaps possible to utilize the model loading stage to get these parameters OTF (or in a generation pass)? |
|
I moved the code where changes would need to be made from
I'm not sure I understand your question. My goal is to produce tiny models that can be evaluated with minimal resources for testing specific language model architectures. I'm currently hardcoding the metadata like |
Right, so what I was thinking is that this can be retrieved from the tensor loading stage of an arch. |
|
If a model is already on disk or with #19796 it could be done but for my usecase I think it wouldn't make sense. However, if we want to test for changes to the outputs of real models I would just write the new code in such a way that the following things can be done separately:
|
|
I pushed a version that has rudimentary support for generating models as well as testing whether results have changed. The CLI usage looks something like this: ./build/bin/test-llama-archs gen-model --model model.gguf
./build/bin/test-llama-archs gen-results --model model.gguf --results results.gguf
./build/bin/test-llama-archs test-vs-disk --model model.gguf --results results.gguf
export mn=llama_3-8b && export q=q4_0
./build/bin/test-llama-archs gen-results --model models/opt/${mn}-${q}.gguf --results results.gguf
./build/bin/test-llama-archs test-vs-disk --model models/opt/${mn}-${q}.gguf --results results.ggufI'm thinking it would make sense to write a script for automating |
|
I implemented for the new code to re-use the buffer type selection logic from the preexisting code by creating a tensor with the expected dimensions on-the-fly. This then causes the embedding tensor to remain on CPU vs. my previous iteration where I just hard-coded all tensors to go to one buffer type. So then I ran into the same issue you did and I took over the fix you posted. Please let me know if you make further changes to the related code but for a model created in-memory I think setting things like mmap to false is correct either way though. |
|
I think we should keep what you have now until we merge. After that, we should consider 2 major refactors in
For now, we can accept the hacky way in order to unblock work that depends on the ability to generate small models. |
To be clear, if in the user code such tensors were to be generated they would as of right now not be used as the actual weight tensors of the model. The model loader generates a bunch of empty tensors from the GGUF file, the code in |
|
@ggerganov @CISC do you have an opinion on whether the current "modes" of |
|
I have no particular opinion, but there's no code-sharing between the modes, so splitting them up would make sense. |
|
I pushed a first version for systematically testing our supported model architectures with in-memory models. The test with CUDA vs. CPU is taking ~1.5 seconds with ~50% of the models supported. I made some KV retrievals optional if the inference code supports it. I think there is a bug in the MBT code regarding how a view is calculated, I'll need to get a real model to check how the tensor shapes are actually supposed to look like. For LLAMA_EMBED CPU and CUDA yield inconsistent results. |
|
I've pushed a version that tests >90% of our supported models and that provides a new binary |
38d2885 to
afa3d70
Compare
|
From my end I would now consider this PR ready for review.
|
bbd4e0a to
803d3a1
Compare
|
The |
|
I am able to reproduce - looking into it. |
|
It will take me a bit of time to fix this - I think it is a problem in the Metal backend when it uses a Paravirtual Metal device (i.e. what the Github runner uses). For now, you can workaround using this patch: diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 30365a361..c219cc237 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -93,7 +93,7 @@ jobs:
id: cmake_test
run: |
cd build
- ctest -L main --verbose --timeout 900
+ ctest -L main -E "test-llama-arch" --verbose --timeout 900
macOS-latest-cmake-x64:
runs-on: macos-15-intelI'll try to fix this properly later in a follow-up PR. |
|
Regarding the expert weight scale: is there a reason why we have separate arguments for whether or not to scale the weights and also for the scale itself? It seems to me like we could just do something like |
I'm guessing it's legacy carry-over, don't see why it couldn't just be a check like that in |
|
Anyways, thanks for zapping me regarding the scales. For now I've pushed a version that uses the scales conditionally. While checking the usage I noticed though that |
|
Yes, the |
I'll make a PR after this is merged. |
|
Without knowing anything about the implementation details of Vulkan, I think most likely this is the result of numerical instability. |
|
The failure reproduces reliably for me with coopmat2 disabled, e.g. using |
|
I'll look into it, yes. |
|
It passes all models reliably for me on AMD 8060S with or without coopmat, and on Intel Meteor Lake. It also passes on llvmpipe without coopmat. |
|
@0cc4m are you saying you can't reproduce it at all? Not even on RTX 3090? |
|
Yes, I've now also tested on the RTX 3090 as well, couldn't reproduce it without coopmat, or with coopmat1 or 2. I haven't seen a single failure yet, except llvmpipe with coopmat. |
|
OK, I'll debug it a bit. |
* tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments
* tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments
Edit: sorry, I misclicked and accidentally opened the PR without a description.
This PR aims to implement end-to-end tests for model inference comparable to
test-backend-opsexcept by creating toy models with random data. This would allow us to assert that the llama.cpp code runs without crashing and that the results between different backends are consistent. For me the main use case will be quality control in the context of #19378 to more easily test whether the meta backend works correctly for the model archs we support. We can also check whether a roundtrip withllama_model_save_to_fileandllama_model_load_from_fileworks correctly. long-term, if the training code works reliably for all model archs we can also train small toy models to overfit on short texts and assert that they can correctly predict it afterwards.The test code works as follows: the llama API is extended with a new function
llama_model_initthat accepts a GGUF context with model metadata as well as a user-defined function to set the data of an individual tensor. Internally this then hijacksllama_model_loaderto initialize the model weights with the passed function rather than by loading the weights from disk. The new code intest-llama-archs.cppis currently just a PoC - I intend to iterate over the values inllama-arch.hfor the final tests - long-term I think it may make sense to expose enums such asllm_archto facilitate easier model creation via the llama API.@ggerganov @CISC if this PoC does not meet your requirements, please tell me now when I haven't yet invested much time.