Misc. bug: llama-server fails to load multi-shard GGUF from HF cache (selects wrong shard)

### Name and Version

```
version: 8532 (0a524f240)
built with GNU 13.3.0 for Linux x86_64
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
llama-server --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:MXFP4_MOE --offline --cache-prompt -c 0
```

### Problem description & steps to reproduce

When loading a multi-shard GGUF model from the HuggingFace cache (either via `--offline` with cached files, or after the new HF cache migration from #20775), `llama-server` may attempt to load the wrong shard (e.g. shard 2 instead of shard 1), causing the error:

```
llama_model_load: error loading model: illegal split file idx: 1 (file: .../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf), model must be loaded with the first split
```

**Root cause:** `get_split_files()` in `common/download.cpp` collects all matching shards but does not sort them by shard index. The order depends on `get_cached_files()` which iterates the filesystem directory in arbitrary order. `model_files[0]` is then passed as the model path, but it may not be shard 1.

**Steps to reproduce:**

1. Download a multi-shard model to the HF cache:
   ```
   HF_HUB_CACHE=/path/to/cache hf download unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF --include "MXFP4_MOE/*"
   ```
2. Run llama-server in offline mode:
   ```
   HF_HUB_CACHE=/path/to/cache llama-server --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:MXFP4_MOE --offline -c 0
   ```

Whether it fails depends on filesystem directory iteration order, so it may not reproduce on all systems or all models. It is consistently reproducible with the Nemotron model above on ext4.

**Fix:** Sort the result of `get_split_files()` by shard index before returning. A one-line fix:

```cpp
// in common/download.cpp, get_split_files(), before "return result;":
std::sort(result.begin(), result.end(), [](const hf_cache::hf_file & a, const hf_cache::hf_file & b) {
    return get_gguf_split_info(a.path).index < get_gguf_split_info(b.path).index;
});
```

Additionally, `migrate_old_cache_to_hf_cache()` in `common/hf-cache.cpp` only migrates the files referenced in the manifest JSON (`ggufFile` and `mmprojFile`), which is only shard 1 for multi-shard models. The remaining shards are left behind in the old cache, resulting in a broken model that cannot load. The migration should expand shard filenames (e.g. `00001-of-00003` -> all 3 shards).

### First Bad Commit

Introduced by #20775 (common: add standard Hugging Face cache support), commit 8c7957ca3.

### Relevant log output

<details>
<summary>Logs</summary>

```console
srv    load_model: loading model '.../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf'
llama_model_load: error loading model: illegal split file idx: 1 (file: .../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf), model must be loaded with the first split
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '.../MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00002-of-00003.gguf'
srv    load_model: failed to load model
main: exiting due to model loading error
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: llama-server fails to load multi-shard GGUF from HF cache (selects wrong shard) #21016

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Misc. bug: llama-server fails to load multi-shard GGUF from HF cache (selects wrong shard) #21016

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions