Summary
When running llama-server behind a Caddy load balancer with lb_policy round_robin, consecutive LLM calls from the same Netclaw session alternate between GPU backends. This defeats llama-server's internal KV cache — the prefix from turn 1 is stranded on gpu0 when turn 2 lands on gpu1. With LLAMA_PARALLEL=4 on production, even same-GPU requests can hit different slots.
The result: every turn pays full prompt processing cost as if it's a cold start, even though the server supports prefix caching internally.
Root cause
NetclawChatClientProvider creates one IChatClient per model role at startup (src/Netclaw.Daemon/Configuration/NetclawChatClientProvider.cs:17-24). All sessions share the same underlying HttpClient (src/Netclaw.Providers/ProviderPluginBase.cs:43-50). There's no per-session identity in the HTTP request for the load balancer to pin on.
Caddy config (services/llama-server/Caddyfile:13):
Proposed fix
Netclaw side: session-aware delegating handler
Add a DelegatingHandler in the provider pipeline that injects an X-Session-Id header (or X-LB-Hash) into each LLM request. The value should be the Netclaw SessionId so all turns in the same conversation route to the same backend.
This requires threading the session identity into the HTTP layer. Options:
Option A — per-session HttpClient: IChatClientProvider.GetClient() gains a session-id parameter, returns (or caches) a client with a pinned header. Simple but creates more HTTP connections.
Option B — AsyncLocal or ChatOptions.AdditionalProperties: Thread session ID via ChatOptions metadata, picked up by a delegating handler that sets the header per-request. Single shared HttpClient, no connection proliferation.
Option B is cleaner — one HttpClient connection pool, session affinity via header only.
Caddy side: hash on session header
lb_policy header X-Session-Id
Caddy's header policy hashes the header value to select a backend. Same session ID → same GPU, deterministically. Different sessions distribute across GPUs naturally.
For the LLAMA_PARALLEL slot problem: llama-server doesn't support slot pinning via HTTP header today. The GPU affinity alone is a significant win — same GPU means the KV cache is at least present in GPU memory. Slot reuse within the same GPU happens naturally under low-to-moderate load.
Expected impact
- Multi-turn sessions: turn 2+ should see measurably faster TTFT because the KV cache prefix from turn 1 is on the same GPU
- Tool loops: tool-call → result → follow-up LLM call sequences (3+ calls in rapid succession) should benefit significantly since they happen within seconds
- Compaction observer: the sidecar LLM call during compaction would ideally route to the same GPU as the main session (shared prefix)
Measurement
Before/after comparison using wall-clock TTFT across multi-turn sessions. The "dark matter" approach: we can't observe the KV cache directly, but same-GPU routing should produce measurably lower latency on turn 2+ compared to round-robin.
Files likely touched
src/Netclaw.Providers/ProviderPluginBase.cs — delegating handler registration
src/Netclaw.Providers/SelfHosted/OpenAiCompatibleProviderPlugin.cs — thread session context
src/Netclaw.Configuration/IChatClientProvider.cs — may need session-aware API
services/llama-server/Caddyfile (testlab-setup repo) — lb_policy header X-Session-Id
Out of scope
Summary
When running llama-server behind a Caddy load balancer with
lb_policy round_robin, consecutive LLM calls from the same Netclaw session alternate between GPU backends. This defeats llama-server's internal KV cache — the prefix from turn 1 is stranded on gpu0 when turn 2 lands on gpu1. WithLLAMA_PARALLEL=4on production, even same-GPU requests can hit different slots.The result: every turn pays full prompt processing cost as if it's a cold start, even though the server supports prefix caching internally.
Root cause
NetclawChatClientProvidercreates oneIChatClientper model role at startup (src/Netclaw.Daemon/Configuration/NetclawChatClientProvider.cs:17-24). All sessions share the same underlyingHttpClient(src/Netclaw.Providers/ProviderPluginBase.cs:43-50). There's no per-session identity in the HTTP request for the load balancer to pin on.Caddy config (
services/llama-server/Caddyfile:13):Proposed fix
Netclaw side: session-aware delegating handler
Add a
DelegatingHandlerin the provider pipeline that injects anX-Session-Idheader (orX-LB-Hash) into each LLM request. The value should be the NetclawSessionIdso all turns in the same conversation route to the same backend.This requires threading the session identity into the HTTP layer. Options:
Option A — per-session
HttpClient:IChatClientProvider.GetClient()gains a session-id parameter, returns (or caches) a client with a pinned header. Simple but creates more HTTP connections.Option B —
AsyncLocalorChatOptions.AdditionalProperties: Thread session ID viaChatOptionsmetadata, picked up by a delegating handler that sets the header per-request. Single sharedHttpClient, no connection proliferation.Option B is cleaner — one
HttpClientconnection pool, session affinity via header only.Caddy side: hash on session header
Caddy's
headerpolicy hashes the header value to select a backend. Same session ID → same GPU, deterministically. Different sessions distribute across GPUs naturally.For the
LLAMA_PARALLELslot problem: llama-server doesn't support slot pinning via HTTP header today. The GPU affinity alone is a significant win — same GPU means the KV cache is at least present in GPU memory. Slot reuse within the same GPU happens naturally under low-to-moderate load.Expected impact
Measurement
Before/after comparison using wall-clock TTFT across multi-turn sessions. The "dark matter" approach: we can't observe the KV cache directly, but same-GPU routing should produce measurably lower latency on turn 2+ compared to round-robin.
Files likely touched
src/Netclaw.Providers/ProviderPluginBase.cs— delegating handler registrationsrc/Netclaw.Providers/SelfHosted/OpenAiCompatibleProviderPlugin.cs— thread session contextsrc/Netclaw.Configuration/IChatClientProvider.cs— may need session-aware APIservices/llama-server/Caddyfile(testlab-setup repo) —lb_policy header X-Session-IdOut of scope
LLAMA_PARALLEL(requires llama-server changes, not ours to make)