Name and Version
version: 8532 (0a524f240)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:MXFP4_MOE --offline --cache-prompt -c 0
Problem description & steps to reproduce
When loading a multi-shard GGUF model from the HuggingFace cache (either via --offline with cached files, or after the new HF cache migration from #20775), llama-server may attempt to load the wrong shard (e.g. shard 2 instead of shard 1), causing the error:
llama_model_load: error loading model: illegal split file idx: 1 (file: .../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf), model must be loaded with the first split
Root cause: get_split_files() in common/download.cpp collects all matching shards but does not sort them by shard index. The order depends on get_cached_files() which iterates the filesystem directory in arbitrary order. model_files[0] is then passed as the model path, but it may not be shard 1.
Steps to reproduce:
- Download a multi-shard model to the HF cache:
HF_HUB_CACHE=/path/to/cache hf download unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF --include "MXFP4_MOE/*"
- Run llama-server in offline mode:
HF_HUB_CACHE=/path/to/cache llama-server --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:MXFP4_MOE --offline -c 0
Whether it fails depends on filesystem directory iteration order, so it may not reproduce on all systems or all models. It is consistently reproducible with the Nemotron model above on ext4.
Fix: Sort the result of get_split_files() by shard index before returning. A one-line fix:
// in common/download.cpp, get_split_files(), before "return result;":
std::sort(result.begin(), result.end(), [](const hf_cache::hf_file & a, const hf_cache::hf_file & b) {
return get_gguf_split_info(a.path).index < get_gguf_split_info(b.path).index;
});
Additionally, migrate_old_cache_to_hf_cache() in common/hf-cache.cpp only migrates the files referenced in the manifest JSON (ggufFile and mmprojFile), which is only shard 1 for multi-shard models. The remaining shards are left behind in the old cache, resulting in a broken model that cannot load. The migration should expand shard filenames (e.g. 00001-of-00003 -> all 3 shards).
First Bad Commit
Introduced by #20775 (common: add standard Hugging Face cache support), commit 8c7957c.
Relevant log output
Logs
srv load_model: loading model '.../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf'
llama_model_load: error loading model: illegal split file idx: 1 (file: .../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf), model must be loaded with the first split
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '.../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf'
srv load_model: failed to load model
main: exiting due to model loading error
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
When loading a multi-shard GGUF model from the HuggingFace cache (either via
--offlinewith cached files, or after the new HF cache migration from #20775),llama-servermay attempt to load the wrong shard (e.g. shard 2 instead of shard 1), causing the error:Root cause:
get_split_files()incommon/download.cppcollects all matching shards but does not sort them by shard index. The order depends onget_cached_files()which iterates the filesystem directory in arbitrary order.model_files[0]is then passed as the model path, but it may not be shard 1.Steps to reproduce:
Whether it fails depends on filesystem directory iteration order, so it may not reproduce on all systems or all models. It is consistently reproducible with the Nemotron model above on ext4.
Fix: Sort the result of
get_split_files()by shard index before returning. A one-line fix:Additionally,
migrate_old_cache_to_hf_cache()incommon/hf-cache.cpponly migrates the files referenced in the manifest JSON (ggufFileandmmprojFile), which is only shard 1 for multi-shard models. The remaining shards are left behind in the old cache, resulting in a broken model that cannot load. The migration should expand shard filenames (e.g.00001-of-00003-> all 3 shards).First Bad Commit
Introduced by #20775 (common: add standard Hugging Face cache support), commit 8c7957c.
Relevant log output
Logs