fix(config): skip vocab arrays and mmap GGUF headers to speed up startup by Dennisadira · Pull Request #10213 · mudler/LocalAI

Dennisadira · 2026-06-07T21:23:48Z

What

On startup LocalAI parses every model's GGUF header to guess defaults (context size, GPU layers, chat template, MTP head, …). That parse was reading the entire tokenizer vocab — tokenizer.ggml.tokens/scores/merges, often 100k–260k entries — element by element, once per model. None of the guessed defaults need the vocab contents, only scalar metadata and array lengths.

This parses the header with two options instead:

SkipLargeMetadata() — seek past large array-valued metadata rather than reading/allocating every element (lengths stay populated).
UseMMap() — fault in a few header pages instead of issuing hundreds of thousands of tiny read() syscalls. The mapping is released when ParseGGUFFile returns.

The same scalar-only access pattern exists in the VRAM metadata reader (pkg/vram/gguf_reader.go), so it gets the same treatment.

Why

Fixes #9790. On slow storage (e.g. a models directory on a Docker Desktop volume, where every read crosses the VM/filesystem boundary) the per-element vocab reads dominate boot time — the reporter saw ~3 minutes for a single small model before the HTTP server came up.

Impact

Measured on a generated 256k-token GGUF:

Parse mode	`read()` syscalls
before	~524,000
`SkipLargeMetadata`	~262,000
`+ UseMMap` (this PR)	8

Verified that every field consumed downstream (architecture, head/ff counts, chat_template, BOS/EOS IDs, TokensLength/TokensSize, and EstimateLLaMACppRun which uses only lengths) is unchanged under both options.

Tests

Adds a regression spec in core/config/hooks_test.go that writes a real GGUF with a large skipped vocab and runs the actual hook path, asserting the chat template, context size, GPU layers, UseTokenizerTemplate, and FLAG_CHAT are still guessed correctly — plus an unreadable-file fallback case. go test ./core/config/ ./pkg/vram/ passes.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

When the models directory holds many GGUF files, startup parsed every model's full GGUF — including the tokenizer vocab arrays (tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per model while guessing defaults. On slow storage (e.g. a models directory on a Docker volume) those hundreds of thousands of tiny reads dominate boot time before the HTTP server comes up. The default-guessing path and the VRAM metadata reader only consume scalar metadata and array lengths, never the array contents. Parse with SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few header pages instead of issuing per-element read() syscalls). For a 256k-token vocab this cuts the parse from ~524k read() syscalls to 8. The mapping is released when ParseGGUFFile returns. Fixes mudler#9790 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>

mudler

good catch actually. Thanks

mudler approved these changes Jun 7, 2026

View reviewed changes

mudler merged commit 2c804be into mudler:master Jun 7, 2026
1 check passed

localai-bot added the bug Something isn't working label Jun 10, 2026

BrewTestBot mentioned this pull request Jun 10, 2026

localai 4.4.0 Homebrew/homebrew-core#287347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(config): skip vocab arrays and mmap GGUF headers to speed up startup#10213

fix(config): skip vocab arrays and mmap GGUF headers to speed up startup#10213
mudler merged 1 commit into
mudler:masterfrom
Dennisadira:fix/slow-startup-gguf-metadata-9790

Dennisadira commented Jun 7, 2026

Uh oh!

mudler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Dennisadira commented Jun 7, 2026

What

Why

Impact

Tests

Uh oh!

mudler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants