Skip to content

fix(config): skip vocab arrays and mmap GGUF headers to speed up startup#10213

Merged
mudler merged 1 commit into
mudler:masterfrom
Dennisadira:fix/slow-startup-gguf-metadata-9790
Jun 7, 2026
Merged

fix(config): skip vocab arrays and mmap GGUF headers to speed up startup#10213
mudler merged 1 commit into
mudler:masterfrom
Dennisadira:fix/slow-startup-gguf-metadata-9790

Conversation

@Dennisadira

Copy link
Copy Markdown
Contributor

What

On startup LocalAI parses every model's GGUF header to guess defaults (context size, GPU layers, chat template, MTP head, …). That parse was reading the entire tokenizer vocabtokenizer.ggml.tokens/scores/merges, often 100k–260k entries — element by element, once per model. None of the guessed defaults need the vocab contents, only scalar metadata and array lengths.

This parses the header with two options instead:

  • SkipLargeMetadata() — seek past large array-valued metadata rather than reading/allocating every element (lengths stay populated).
  • UseMMap() — fault in a few header pages instead of issuing hundreds of thousands of tiny read() syscalls. The mapping is released when ParseGGUFFile returns.

The same scalar-only access pattern exists in the VRAM metadata reader (pkg/vram/gguf_reader.go), so it gets the same treatment.

Why

Fixes #9790. On slow storage (e.g. a models directory on a Docker Desktop volume, where every read crosses the VM/filesystem boundary) the per-element vocab reads dominate boot time — the reporter saw ~3 minutes for a single small model before the HTTP server came up.

Impact

Measured on a generated 256k-token GGUF:

Parse mode read() syscalls
before ~524,000
SkipLargeMetadata ~262,000
+ UseMMap (this PR) 8

Verified that every field consumed downstream (architecture, head/ff counts, chat_template, BOS/EOS IDs, TokensLength/TokensSize, and EstimateLLaMACppRun which uses only lengths) is unchanged under both options.

Tests

Adds a regression spec in core/config/hooks_test.go that writes a real GGUF with a large skipped vocab and runs the actual hook path, asserting the chat template, context size, GPU layers, UseTokenizerTemplate, and FLAG_CHAT are still guessed correctly — plus an unreadable-file fallback case. go test ./core/config/ ./pkg/vram/ passes.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

When the models directory holds many GGUF files, startup parsed every
model's full GGUF — including the tokenizer vocab arrays
(tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per
model while guessing defaults. On slow storage (e.g. a models directory
on a Docker volume) those hundreds of thousands of tiny reads dominate
boot time before the HTTP server comes up.

The default-guessing path and the VRAM metadata reader only consume
scalar metadata and array lengths, never the array contents. Parse with
SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few
header pages instead of issuing per-element read() syscalls). For a
256k-token vocab this cuts the parse from ~524k read() syscalls to 8.
The mapping is released when ParseGGUFFile returns.

Fixes mudler#9790

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>

@mudler mudler left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch actually. Thanks

@mudler mudler merged commit 2c804be into mudler:master Jun 7, 2026
1 check passed
@localai-bot localai-bot added the bug Something isn't working label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

3 participants