feat(providers): session-sticky LLM routing via X-Session-Id header#610
Merged
Conversation
Self-hosted inference servers behind a load balancer (e.g., Caddy round-robin across multiple GPUs) defeat KV cache reuse when consecutive requests from the same session land on different backends. Every turn and every tool-call follow-up pays full prompt processing cost as if it were a cold start. Add a DelegatingHandler on the HttpClient pipeline that promotes an ambient session ID to an X-Session-Id HTTP header. The load balancer can hash on this header to pin same-session requests to the same backend GPU. The ambient context is set in SessionLlmInvoker (not the actor) so sidecar calls (compaction, title generation, memory extraction) that bypass the invoker naturally omit the header and round-robin across backends — avoiding KV cache slot contention with the main session. Closes #609
12b70b2 to
574e6ef
Compare
This was referenced Apr 12, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Self-hosted inference servers behind a load balancer (e.g., Caddy round-robin across multiple GPUs) defeat KV cache reuse when consecutive requests from the same session land on different backends. Every turn and every tool-call follow-up pays full prompt processing cost as a cold start.
DelegatingHandleron theHttpClientpipeline that promotes an ambient session ID to anX-Session-IdHTTP headerlb_policy header X-Session-Idin Caddy) to pin same-session requests to the same backend GPUProviderPluginBase.CreateLlmHttpClient()SessionLlmInvokerand naturally omit the header, round-robining across backends without competing for the main session's KV cache slotArchitecture
Impact on managed providers
None. Anthropic, OpenAI, and OpenRouter handle routing server-side. The header is harmless — these APIs ignore unknown request headers. The feature only matters for self-hosted deployments (llama-server, vLLM, etc.) behind a load balancer.
Closes #609
Test plan
Handler_adds_header_when_context_is_set— header present with correct valueHandler_omits_header_when_context_is_null— no header for sidecar callsContext_flows_through_async_boundary— AsyncLocal survives Task.Yield