fix(config): skip vocab arrays and mmap GGUF headers to speed up startup#10213
Merged
mudler merged 1 commit intoJun 7, 2026
Merged
Conversation
When the models directory holds many GGUF files, startup parsed every model's full GGUF — including the tokenizer vocab arrays (tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per model while guessing defaults. On slow storage (e.g. a models directory on a Docker volume) those hundreds of thousands of tiny reads dominate boot time before the HTTP server comes up. The default-guessing path and the VRAM metadata reader only consume scalar metadata and array lengths, never the array contents. Parse with SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few header pages instead of issuing per-element read() syscalls). For a 256k-token vocab this cuts the parse from ~524k read() syscalls to 8. The mapping is released when ParseGGUFFile returns. Fixes mudler#9790 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
On startup LocalAI parses every model's GGUF header to guess defaults (context size, GPU layers, chat template, MTP head, …). That parse was reading the entire tokenizer vocab —
tokenizer.ggml.tokens/scores/merges, often 100k–260k entries — element by element, once per model. None of the guessed defaults need the vocab contents, only scalar metadata and array lengths.This parses the header with two options instead:
SkipLargeMetadata()— seek past large array-valued metadata rather than reading/allocating every element (lengths stay populated).UseMMap()— fault in a few header pages instead of issuing hundreds of thousands of tinyread()syscalls. The mapping is released whenParseGGUFFilereturns.The same scalar-only access pattern exists in the VRAM metadata reader (
pkg/vram/gguf_reader.go), so it gets the same treatment.Why
Fixes #9790. On slow storage (e.g. a models directory on a Docker Desktop volume, where every read crosses the VM/filesystem boundary) the per-element vocab reads dominate boot time — the reporter saw ~3 minutes for a single small model before the HTTP server came up.
Impact
Measured on a generated 256k-token GGUF:
read()syscallsSkipLargeMetadata+ UseMMap(this PR)Verified that every field consumed downstream (architecture, head/ff counts,
chat_template, BOS/EOS IDs,TokensLength/TokensSize, andEstimateLLaMACppRunwhich uses only lengths) is unchanged under both options.Tests
Adds a regression spec in
core/config/hooks_test.gothat writes a real GGUF with a large skipped vocab and runs the actual hook path, asserting the chat template, context size, GPU layers,UseTokenizerTemplate, andFLAG_CHATare still guessed correctly — plus an unreadable-file fallback case.go test ./core/config/ ./pkg/vram/passes.Assisted-by: Claude:claude-opus-4-8 [Claude Code]