feat(providers): session-sticky LLM routing via X-Session-Id header by Aaronontheweb · Pull Request #610 · netclaw-dev/netclaw

Aaronontheweb · 2026-04-12T13:06:45Z

Summary

Self-hosted inference servers behind a load balancer (e.g., Caddy round-robin across multiple GPUs) defeat KV cache reuse when consecutive requests from the same session land on different backends. Every turn and every tool-call follow-up pays full prompt processing cost as a cold start.

Adds a DelegatingHandler on the HttpClient pipeline that promotes an ambient session ID to an X-Session-Id HTTP header
The load balancer can hash on this header (lb_policy header X-Session-Id in Caddy) to pin same-session requests to the same backend GPU
Works for all providers automatically — wired into ProviderPluginBase.CreateLlmHttpClient()
Only main-model calls get the header; sidecar calls (compaction, title gen, memory extraction) bypass SessionLlmInvoker and naturally omit the header, round-robining across backends without competing for the main session's KV cache slot

Architecture

SessionAffinityContext (AsyncLocal<string?>) — ambient context in Netclaw.Configuration
        ↓ set by SessionLlmInvoker.InvokeAsync()
SessionAffinityHandler (DelegatingHandler) — reads context, adds X-Session-Id header
        ↓ wired into every HttpClient via ProviderPluginBase
Load Balancer (Caddy) — hashes on X-Session-Id → same GPU

Impact on managed providers

None. Anthropic, OpenAI, and OpenRouter handle routing server-side. The header is harmless — these APIs ignore unknown request headers. The feature only matters for self-hosted deployments (llama-server, vLLM, etc.) behind a load balancer.

Closes #609

Test plan

Handler_adds_header_when_context_is_set — header present with correct value
Handler_omits_header_when_context_is_null — no header for sidecar calls
Context_flows_through_async_boundary — AsyncLocal survives Task.Yield
Full actor test suite passes (971 tests)
Slopwatch clean
CI passes

Self-hosted inference servers behind a load balancer (e.g., Caddy round-robin across multiple GPUs) defeat KV cache reuse when consecutive requests from the same session land on different backends. Every turn and every tool-call follow-up pays full prompt processing cost as if it were a cold start. Add a DelegatingHandler on the HttpClient pipeline that promotes an ambient session ID to an X-Session-Id HTTP header. The load balancer can hash on this header to pin same-session requests to the same backend GPU. The ambient context is set in SessionLlmInvoker (not the actor) so sidecar calls (compaction, title generation, memory extraction) that bypass the invoker naturally omit the header and round-robin across backends — avoiding KV cache slot contention with the main session. Closes #609

Aaronontheweb force-pushed the session-sticky-routing branch from 12b70b2 to 574e6ef Compare April 12, 2026 13:17

Aaronontheweb added the sessions LLM session actor, turn lifecycle, pipelines label Apr 12, 2026

Aaronontheweb merged commit fb9eedf into dev Apr 12, 2026
4 checks passed

Aaronontheweb deleted the session-sticky-routing branch April 12, 2026 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(providers): session-sticky LLM routing via X-Session-Id header#610

feat(providers): session-sticky LLM routing via X-Session-Id header#610
Aaronontheweb merged 1 commit into
devfrom
session-sticky-routing

Aaronontheweb commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Apr 12, 2026

Summary

Architecture

Impact on managed providers

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant