Skip to content

Misc. bug: llama-server fails to load multi-shard GGUF from HF cache (selects wrong shard) #21016

@hmblair

Description

@hmblair

Name and Version

version: 8532 (0a524f240)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:MXFP4_MOE --offline --cache-prompt -c 0

Problem description & steps to reproduce

When loading a multi-shard GGUF model from the HuggingFace cache (either via --offline with cached files, or after the new HF cache migration from #20775), llama-server may attempt to load the wrong shard (e.g. shard 2 instead of shard 1), causing the error:

llama_model_load: error loading model: illegal split file idx: 1 (file: .../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf), model must be loaded with the first split

Root cause: get_split_files() in common/download.cpp collects all matching shards but does not sort them by shard index. The order depends on get_cached_files() which iterates the filesystem directory in arbitrary order. model_files[0] is then passed as the model path, but it may not be shard 1.

Steps to reproduce:

  1. Download a multi-shard model to the HF cache:
    HF_HUB_CACHE=/path/to/cache hf download unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF --include "MXFP4_MOE/*"
    
  2. Run llama-server in offline mode:
    HF_HUB_CACHE=/path/to/cache llama-server --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:MXFP4_MOE --offline -c 0
    

Whether it fails depends on filesystem directory iteration order, so it may not reproduce on all systems or all models. It is consistently reproducible with the Nemotron model above on ext4.

Fix: Sort the result of get_split_files() by shard index before returning. A one-line fix:

// in common/download.cpp, get_split_files(), before "return result;":
std::sort(result.begin(), result.end(), [](const hf_cache::hf_file & a, const hf_cache::hf_file & b) {
    return get_gguf_split_info(a.path).index < get_gguf_split_info(b.path).index;
});

Additionally, migrate_old_cache_to_hf_cache() in common/hf-cache.cpp only migrates the files referenced in the manifest JSON (ggufFile and mmprojFile), which is only shard 1 for multi-shard models. The remaining shards are left behind in the old cache, resulting in a broken model that cannot load. The migration should expand shard filenames (e.g. 00001-of-00003 -> all 3 shards).

First Bad Commit

Introduced by #20775 (common: add standard Hugging Face cache support), commit 8c7957c.

Relevant log output

Logs
srv    load_model: loading model '.../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf'
llama_model_load: error loading model: illegal split file idx: 1 (file: .../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf), model must be loaded with the first split
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '.../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf'
srv    load_model: failed to load model
main: exiting due to model loading error

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions