Summary
The llama-cpp gRPC backend supports two prompt-cache related options (cache_ram, kv_unified) that are parsed in backend/cpp/llama-cpp/grpc-server.cpp from ModelOptions.options[] (proto field 62), but they are not documented anywhere in the user-facing config docs. This is a major hidden latency win for use-cases that re-send the same system prompt across multiple chat-completion calls (agents, Claude-Code-style CLIs, coding-assistants).
Tested on LocalAI v4.2.6 on NVIDIA DGX Spark (GB10/ARM64, CUDA 13), llama-cpp backend cuda13-nvidia-l4t-arm64-llama-cpp.
Background — what the C++ code already does
backend/cpp/llama-cpp/grpc-server.cpp (already in master):
// Line 489–490
params.cache_ram_mib = -1; // default no limit
// Line 520–521
params.kv_unified = false; // default
// Line 531 — comment says options come as "optname:optval"
// Line 552–555
} else if (!strcmp(optname, "cache_ram")) {
if (optval_str) {
params.cache_ram_mib = std::stoi(optval_str);
}
}
// Line 677–681
} else if (!strcmp(optname, "kv_unified") || !strcmp(optname, "unified_kv")) {
params.kv_unified = (optval_str && strcmp(optval_str, "true") == 0);
}
strings on the binary confirms it ships with the upstream llama.cpp prompt-cache code, but the messages are never seen in real-world LocalAI deployments because the options are off-by-default and there's no doc telling users how to turn them on.
Repro / verify
- Run LocalAI v4.2.6 with a llama-cpp model (any GGUF that exercises a large system prompt).
- Send 3 identical chat-completion calls with the same ~5k-token system prompt — observe each call re-evaluates the system prompt from scratch (38-40s on GB10).
- Now add to the model YAML:
options:
- "cache_ram:4096"
- "kv_unified:true"
- Restart, repeat. Calls 2-3 should hit the warm prompt cache and finish in seconds.
(I'm verifying the second half right now on my Spark; will edit with numbers.)
Asks
- Docs: the
options: YAML array and the available optname:optval pairs are essentially undocumented. A docs page listing all llama-cpp-backend options would be a huge usability win.
- First-class YAML fields (nice-to-have): promote
cache_ram and kv_unified to top-level fields (like context_size, gpu_layers, f16, mmap), since they have a comparable latency impact for any agent/CLI workload. Backward-compatible — keep options: as the escape hatch.
Use-case context
I'm using LocalAI as an Anthropic-compatible backend for the claude CLI via the /v1/messages endpoint. That endpoint is excellent (tool_use returns proper Anthropic schema, streaming events match — kudos). But every Claude-Code session ships a ~20-25k-token system prompt with the tool schema, and without prompt caching the latency makes the setup non-interactive (5-8 min per turn on GB10). With cache_ram + kv_unified enabled, this collapses to seconds for the cached prefix portion — exactly the use-case the upstream llama.cpp prompt-cache was designed for.
Happy to help with the docs PR if there's interest.
Summary
The llama-cpp gRPC backend supports two prompt-cache related options (
cache_ram,kv_unified) that are parsed inbackend/cpp/llama-cpp/grpc-server.cppfromModelOptions.options[](proto field 62), but they are not documented anywhere in the user-facing config docs. This is a major hidden latency win for use-cases that re-send the same system prompt across multiple chat-completion calls (agents, Claude-Code-style CLIs, coding-assistants).Tested on LocalAI v4.2.6 on NVIDIA DGX Spark (GB10/ARM64, CUDA 13), llama-cpp backend
cuda13-nvidia-l4t-arm64-llama-cpp.Background — what the C++ code already does
backend/cpp/llama-cpp/grpc-server.cpp(already in master):stringson the binary confirms it ships with the upstream llama.cpp prompt-cache code, but the messages are never seen in real-world LocalAI deployments because the options are off-by-default and there's no doc telling users how to turn them on.Repro / verify
(I'm verifying the second half right now on my Spark; will edit with numbers.)
Asks
options:YAML array and the availableoptname:optvalpairs are essentially undocumented. A docs page listing allllama-cpp-backend options would be a huge usability win.cache_ramandkv_unifiedto top-level fields (likecontext_size,gpu_layers,f16,mmap), since they have a comparable latency impact for any agent/CLI workload. Backward-compatible — keepoptions:as the escape hatch.Use-case context
I'm using LocalAI as an Anthropic-compatible backend for the
claudeCLI via the/v1/messagesendpoint. That endpoint is excellent (tool_use returns proper Anthropic schema, streaming events match — kudos). But every Claude-Code session ships a ~20-25k-token system prompt with the tool schema, and without prompt caching the latency makes the setup non-interactive (5-8 min per turn on GB10). Withcache_ram+kv_unifiedenabled, this collapses to seconds for the cached prefix portion — exactly the use-case the upstream llama.cpp prompt-cache was designed for.Happy to help with the docs PR if there's interest.