Docs: llama-cpp options 'cache_ram' + 'kv_unified' not documented (huge latency win for re-prompted system prompts)

## Summary

The llama-cpp gRPC backend supports two prompt-cache related options (`cache_ram`, `kv_unified`) that are parsed in `backend/cpp/llama-cpp/grpc-server.cpp` from `ModelOptions.options[]` (proto field 62), but they are not documented anywhere in the user-facing config docs. This is a major hidden latency win for use-cases that re-send the same system prompt across multiple chat-completion calls (agents, Claude-Code-style CLIs, coding-assistants).

Tested on LocalAI **v4.2.6** on NVIDIA DGX Spark (GB10/ARM64, CUDA 13), llama-cpp backend `cuda13-nvidia-l4t-arm64-llama-cpp`.

## Background — what the C++ code already does

`backend/cpp/llama-cpp/grpc-server.cpp` (already in master):

```cpp
// Line 489–490
params.cache_ram_mib = -1;        // default no limit

// Line 520–521
params.kv_unified = false;        // default

// Line 531 — comment says options come as "optname:optval"
// Line 552–555
} else if (!strcmp(optname, "cache_ram")) {
    if (optval_str) {
        params.cache_ram_mib = std::stoi(optval_str);
    }
}

// Line 677–681
} else if (!strcmp(optname, "kv_unified") || !strcmp(optname, "unified_kv")) {
    params.kv_unified = (optval_str && strcmp(optval_str, "true") == 0);
}
```

`strings` on the binary confirms it ships with the upstream llama.cpp prompt-cache code, but the messages are never seen in real-world LocalAI deployments because the options are off-by-default and there's no doc telling users how to turn them on.

## Repro / verify

1. Run LocalAI v4.2.6 with a llama-cpp model (any GGUF that exercises a large system prompt).
2. Send 3 identical chat-completion calls with the same ~5k-token system prompt — observe each call re-evaluates the system prompt from scratch (38-40s on GB10).
3. Now add to the model YAML:
   ```yaml
   options:
     - "cache_ram:4096"
     - "kv_unified:true"
   ```
4. Restart, repeat. Calls 2-3 should hit the warm prompt cache and finish in seconds.

(I'm verifying the second half right now on my Spark; will edit with numbers.)

## Asks

1. **Docs:** the `options:` YAML array and the available `optname:optval` pairs are essentially undocumented. A docs page listing all `llama-cpp`-backend options would be a huge usability win.
2. **First-class YAML fields (nice-to-have):** promote `cache_ram` and `kv_unified` to top-level fields (like `context_size`, `gpu_layers`, `f16`, `mmap`), since they have a comparable latency impact for any agent/CLI workload. Backward-compatible — keep `options:` as the escape hatch.

## Use-case context

I'm using LocalAI as an Anthropic-compatible backend for the `claude` CLI via the `/v1/messages` endpoint. That endpoint is excellent (tool_use returns proper Anthropic schema, streaming events match — kudos). But every Claude-Code session ships a ~20-25k-token system prompt with the tool schema, and without prompt caching the latency makes the setup non-interactive (5-8 min per turn on GB10). With `cache_ram` + `kv_unified` enabled, this collapses to seconds for the cached prefix portion — exactly the use-case the upstream llama.cpp prompt-cache was designed for.

Happy to help with the docs PR if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docs: llama-cpp options 'cache_ram' + 'kv_unified' not documented (huge latency win for re-prompted system prompts) #9921

Summary

Background — what the C++ code already does

Repro / verify

Asks

Use-case context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Docs: llama-cpp options 'cache_ram' + 'kv_unified' not documented (huge latency win for re-prompted system prompts) #9921

Description

Summary

Background — what the C++ code already does

Repro / verify

Asks

Use-case context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions