Session-sticky LLM routing: add X-Session-Id header for load balancer affinity

## Summary

When running llama-server behind a Caddy load balancer with `lb_policy round_robin`, consecutive LLM calls from the same Netclaw session alternate between GPU backends. This defeats llama-server's internal KV cache — the prefix from turn 1 is stranded on gpu0 when turn 2 lands on gpu1. With `LLAMA_PARALLEL=4` on production, even same-GPU requests can hit different slots.

The result: every turn pays full prompt processing cost as if it's a cold start, even though the server supports prefix caching internally.

## Root cause

`NetclawChatClientProvider` creates one `IChatClient` per model role at startup (`src/Netclaw.Daemon/Configuration/NetclawChatClientProvider.cs:17-24`). All sessions share the same underlying `HttpClient` (`src/Netclaw.Providers/ProviderPluginBase.cs:43-50`). There's no per-session identity in the HTTP request for the load balancer to pin on.

Caddy config (`services/llama-server/Caddyfile:13`):
```
lb_policy round_robin
```

## Proposed fix

### Netclaw side: session-aware delegating handler

Add a `DelegatingHandler` in the provider pipeline that injects an `X-Session-Id` header (or `X-LB-Hash`) into each LLM request. The value should be the Netclaw `SessionId` so all turns in the same conversation route to the same backend.

This requires threading the session identity into the HTTP layer. Options:

**Option A — per-session `HttpClient`**: `IChatClientProvider.GetClient()` gains a session-id parameter, returns (or caches) a client with a pinned header. Simple but creates more HTTP connections.

**Option B — `AsyncLocal` or `ChatOptions.AdditionalProperties`**: Thread session ID via `ChatOptions` metadata, picked up by a delegating handler that sets the header per-request. Single shared `HttpClient`, no connection proliferation.

Option B is cleaner — one `HttpClient` connection pool, session affinity via header only.

### Caddy side: hash on session header

```
lb_policy header X-Session-Id
```

Caddy's `header` policy hashes the header value to select a backend. Same session ID → same GPU, deterministically. Different sessions distribute across GPUs naturally.

For the `LLAMA_PARALLEL` slot problem: llama-server doesn't support slot pinning via HTTP header today. The GPU affinity alone is a significant win — same GPU means the KV cache is at least present in GPU memory. Slot reuse within the same GPU happens naturally under low-to-moderate load.

## Expected impact

- **Multi-turn sessions**: turn 2+ should see measurably faster TTFT because the KV cache prefix from turn 1 is on the same GPU
- **Tool loops**: tool-call → result → follow-up LLM call sequences (3+ calls in rapid succession) should benefit significantly since they happen within seconds
- **Compaction observer**: the sidecar LLM call during compaction would ideally route to the same GPU as the main session (shared prefix)

## Measurement

Before/after comparison using wall-clock TTFT across multi-turn sessions. The "dark matter" approach: we can't observe the KV cache directly, but same-GPU routing should produce measurably lower latency on turn 2+ compared to round-robin.

## Files likely touched

- `src/Netclaw.Providers/ProviderPluginBase.cs` — delegating handler registration
- `src/Netclaw.Providers/SelfHosted/OpenAiCompatibleProviderPlugin.cs` — thread session context
- `src/Netclaw.Configuration/IChatClientProvider.cs` — may need session-aware API
- `services/llama-server/Caddyfile` (testlab-setup repo) — `lb_policy header X-Session-Id`

## Out of scope

- Slot pinning within `LLAMA_PARALLEL` (requires llama-server changes, not ours to make)
- Anthropic prompt caching (handled by their API, not affected by this routing issue)
- System prompt reorder for cache optimization (separate issue #608)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session-sticky LLM routing: add X-Session-Id header for load balancer affinity #609

Summary

Root cause

Proposed fix

Netclaw side: session-aware delegating handler

Caddy side: hash on session header

Expected impact

Measurement

Files likely touched

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Session-sticky LLM routing: add X-Session-Id header for load balancer affinity #609

Description

Summary

Root cause

Proposed fix

Netclaw side: session-aware delegating handler

Caddy side: hash on session header

Expected impact

Measurement

Files likely touched

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions