Skip to content

feat(providers): session-sticky LLM routing via X-Session-Id header#610

Merged
Aaronontheweb merged 1 commit into
devfrom
session-sticky-routing
Apr 12, 2026
Merged

feat(providers): session-sticky LLM routing via X-Session-Id header#610
Aaronontheweb merged 1 commit into
devfrom
session-sticky-routing

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Self-hosted inference servers behind a load balancer (e.g., Caddy round-robin across multiple GPUs) defeat KV cache reuse when consecutive requests from the same session land on different backends. Every turn and every tool-call follow-up pays full prompt processing cost as a cold start.

  • Adds a DelegatingHandler on the HttpClient pipeline that promotes an ambient session ID to an X-Session-Id HTTP header
  • The load balancer can hash on this header (lb_policy header X-Session-Id in Caddy) to pin same-session requests to the same backend GPU
  • Works for all providers automatically — wired into ProviderPluginBase.CreateLlmHttpClient()
  • Only main-model calls get the header; sidecar calls (compaction, title gen, memory extraction) bypass SessionLlmInvoker and naturally omit the header, round-robining across backends without competing for the main session's KV cache slot

Architecture

SessionAffinityContext (AsyncLocal<string?>) — ambient context in Netclaw.Configuration
        ↓ set by SessionLlmInvoker.InvokeAsync()
SessionAffinityHandler (DelegatingHandler) — reads context, adds X-Session-Id header
        ↓ wired into every HttpClient via ProviderPluginBase
Load Balancer (Caddy) — hashes on X-Session-Id → same GPU

Impact on managed providers

None. Anthropic, OpenAI, and OpenRouter handle routing server-side. The header is harmless — these APIs ignore unknown request headers. The feature only matters for self-hosted deployments (llama-server, vLLM, etc.) behind a load balancer.

Closes #609

Test plan

  • Handler_adds_header_when_context_is_set — header present with correct value
  • Handler_omits_header_when_context_is_null — no header for sidecar calls
  • Context_flows_through_async_boundary — AsyncLocal survives Task.Yield
  • Full actor test suite passes (971 tests)
  • Slopwatch clean
  • CI passes

Self-hosted inference servers behind a load balancer (e.g., Caddy
round-robin across multiple GPUs) defeat KV cache reuse when
consecutive requests from the same session land on different backends.
Every turn and every tool-call follow-up pays full prompt processing
cost as if it were a cold start.

Add a DelegatingHandler on the HttpClient pipeline that promotes an
ambient session ID to an X-Session-Id HTTP header. The load balancer
can hash on this header to pin same-session requests to the same
backend GPU.

The ambient context is set in SessionLlmInvoker (not the actor) so
sidecar calls (compaction, title generation, memory extraction) that
bypass the invoker naturally omit the header and round-robin across
backends — avoiding KV cache slot contention with the main session.

Closes #609
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sessions LLM session actor, turn lifecycle, pipelines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session-sticky LLM routing: add X-Session-Id header for load balancer affinity

1 participant