Skip to content

Docs: llama-cpp options 'cache_ram' + 'kv_unified' not documented (huge latency win for re-prompted system prompts) #9921

@pos-ei-don

Description

@pos-ei-don

Summary

The llama-cpp gRPC backend supports two prompt-cache related options (cache_ram, kv_unified) that are parsed in backend/cpp/llama-cpp/grpc-server.cpp from ModelOptions.options[] (proto field 62), but they are not documented anywhere in the user-facing config docs. This is a major hidden latency win for use-cases that re-send the same system prompt across multiple chat-completion calls (agents, Claude-Code-style CLIs, coding-assistants).

Tested on LocalAI v4.2.6 on NVIDIA DGX Spark (GB10/ARM64, CUDA 13), llama-cpp backend cuda13-nvidia-l4t-arm64-llama-cpp.

Background — what the C++ code already does

backend/cpp/llama-cpp/grpc-server.cpp (already in master):

// Line 489–490
params.cache_ram_mib = -1;        // default no limit

// Line 520–521
params.kv_unified = false;        // default

// Line 531 — comment says options come as "optname:optval"
// Line 552–555
} else if (!strcmp(optname, "cache_ram")) {
    if (optval_str) {
        params.cache_ram_mib = std::stoi(optval_str);
    }
}

// Line 677–681
} else if (!strcmp(optname, "kv_unified") || !strcmp(optname, "unified_kv")) {
    params.kv_unified = (optval_str && strcmp(optval_str, "true") == 0);
}

strings on the binary confirms it ships with the upstream llama.cpp prompt-cache code, but the messages are never seen in real-world LocalAI deployments because the options are off-by-default and there's no doc telling users how to turn them on.

Repro / verify

  1. Run LocalAI v4.2.6 with a llama-cpp model (any GGUF that exercises a large system prompt).
  2. Send 3 identical chat-completion calls with the same ~5k-token system prompt — observe each call re-evaluates the system prompt from scratch (38-40s on GB10).
  3. Now add to the model YAML:
    options:
      - "cache_ram:4096"
      - "kv_unified:true"
  4. Restart, repeat. Calls 2-3 should hit the warm prompt cache and finish in seconds.

(I'm verifying the second half right now on my Spark; will edit with numbers.)

Asks

  1. Docs: the options: YAML array and the available optname:optval pairs are essentially undocumented. A docs page listing all llama-cpp-backend options would be a huge usability win.
  2. First-class YAML fields (nice-to-have): promote cache_ram and kv_unified to top-level fields (like context_size, gpu_layers, f16, mmap), since they have a comparable latency impact for any agent/CLI workload. Backward-compatible — keep options: as the escape hatch.

Use-case context

I'm using LocalAI as an Anthropic-compatible backend for the claude CLI via the /v1/messages endpoint. That endpoint is excellent (tool_use returns proper Anthropic schema, streaming events match — kudos). But every Claude-Code session ships a ~20-25k-token system prompt with the tool schema, and without prompt caching the latency makes the setup non-interactive (5-8 min per turn on GB10). With cache_ram + kv_unified enabled, this collapses to seconds for the cached prefix portion — exactly the use-case the upstream llama.cpp prompt-cache was designed for.

Happy to help with the docs PR if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions