-
Notifications
You must be signed in to change notification settings - Fork 14.5k
llama_model_loader: support multiple split/shard GGUFs #6187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: slaren <slarengh@gmail.com>
- use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is something not right when mmap is enabled and -ngl > 0 (at least with Metal that is). Using LLaMA 7B Q4_0:
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 933.75 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 1867.44 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 2801.12 / 147456.00)
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size = 933.69 MiB
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size = 933.69 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Metal buffer size = 933.69 MiB
llm_load_tensors: Metal buffer size = 933.69 MiB
llm_load_tensors: Metal buffer size = 933.69 MiB
.................................................................llama_model_load: error loading model: vector
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './x-00001-of-00005.gguf'
main: error: unable to load model
…nsor optional Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
So it does not manage to allocate 2 out 5 metal buffer, I think we should stop here. Then it tried to load the mapping to a buffer which does not exist. |
@ggerganov Should we accept this case when we cannot allocate |
|
I think there is something wrong. It should only require one CPU buffer, since there is only one tensor allocated in the CPU. The first-last range is probably wrong, and it is causing a buffer overflow. |
|
This should fix it: diff --git a/llama.cpp b/llama.cpp
index cd20ad7a..2b6a5e9e 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -3199,6 +3199,9 @@ struct llama_model_loader {
*addr = mapping->addr;
for (ggml_tensor * tensor = ggml_get_first_tensor(ctx); tensor; tensor = ggml_get_next_tensor(ctx, tensor)) {
const auto & w = get_weights(ggml_get_name(tensor));
+ if (w.idx != idx) {
+ continue;
+ }
*first = std::min(*first, w.offs);
*last = std::max(*last, w.offs + ggml_nbytes(tensor));
}
@@ -5145,6 +5148,9 @@ static bool llm_load_tensors(
void * addr = nullptr;
size_t first, last;
ml.get_mapping_range(&first, &last, &addr, file_no, ctx);
+ if (first >= last) {
+ continue;
+ }
ggml_backend_buffer_t buf = ggml_backend_cpu_buffer_from_ptr((char *)addr + first, last - first);
if (buf != nullptr) {
bufs.push_back(buf);
@@ -5167,6 +5173,9 @@ static bool llm_load_tensors(
void * addr = nullptr;
size_t first, last;
ml.get_mapping_range(&first, &last, &addr, file_no, ctx);
+ if (first >= last) {
+ continue;
+ }
ggml_backend_buffer_t buf = ggml_backend_metal_buffer_from_ptr((char *) addr + first, last - first, max_size);
if (buf != nullptr) {
bufs.push_back(buf);Maybe we need to add a dummy |
|
@ggerganov can you please pull and retry ? I have applied the same logic as before: if the allocation failed, fallback to cpu only |
|
@phymbert the logic is still wrong, it is asking Metal to map a buffer beyond its size. Please check the diff I posted above. |
|
It works with the patch. Will take an extra look at the PR tomorrow. Thank you all for helping out |
…s in split, throw an exception instead of asserting
…uffers in order to free them if an error occurs in the next allocation. Reserve the expected size.
…fore allocating backend buffer
…t dest max len. Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge after slaren's approval
Excellent, really proud of to have contribute until |
|
@ggerganov Could we add this line in Hot topics ? - support loading sharded model split using `gguf-split` cli #6187 |
|
Yes, of course |
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for taking time to implement this functionality.
… allocated Co-authored-by: slaren <slarengh@gmail.com>
| // check if dest ends with postfix | ||
| int size_prefix = str_split_path.size() - str_postfix.size(); | ||
| if (size_prefix > 0 && str_split_path.find(str_postfix, size_prefix) != std::string::npos) { | ||
| snprintf(dest, std::min((size_t) size_prefix, maxlen), "%s", split_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry I was quite rush this time, will be more careful. Thanks!
Motivation
Since we support
gguf-splitCLI in #6135, it is the good time to load model with multiple (potentially distributed) GGUFs for weights. For example, we can expect the Grok-1 weights to not easily fit inside a single GGUF.This change allows to load a model regardless if it is bundled inside a single or multiple GGUFs generated with
gguf-split.Changes
llama_split_pathandllama_split_prefixto allow downstream tool to generate their own GGUFs split using the same file name convention:"%s-%05d-of-%05d.gguf"general.splittosplit.no,general.split_counttosplit.countand addsplit.tensors.count: the previous splits created will not be loaded here. Usegguf-splitfrom d0d5de4 to merge first, then split again with master versionTests
cd models ../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.ggufYou will notice the new:
llama_model_loader: additional 6 GGUFs metadata loaded.References
CI Builds
Tasks
Special thanks to @slaren and @ngxson for having supporting me in this effort.