Skip to content

Implement RoutingStorage (LRU + Redis fallback) #4210

@yrobla

Description

@yrobla

Description

Create pkg/transport/session/storage_routing.go implementing RoutingStorage, a two-tier session storage that checks an LRU-bounded in-memory cache first and falls back to a Redis-backed Storage on cache misses. RoutingStorage implements the session.Storage interface and is the core building block for session-sticky routing across multiple proxyrunner replicas.

Context

Part of epic #263 — Horizontal Scaling for Proxyrunner (THV-0047). When the proxyrunner runs as multiple StatefulSet replicas, each replica must be able to resolve any session to its originating backend pod regardless of which replica was used during the MCP initialize request. RoutingStorage provides this guarantee by persisting the session_id → backend_url binding to Redis (delivered by the vMCP epic, RC-6/RC-7), while keeping a bounded local LRU cache to minimize hot-path Redis round-trips.

The BackendReplicas and SessionCacheSize fields added to RunConfig in TASK-001 (#265) are the configuration source: SessionCacheSize sets the LRU capacity passed to NewRoutingStorage; the downstream wiring tasks (TASK-005 and TASK-006) inject a RoutingStorage instance into the proxy Manager constructors.

Dependencies: #265 (TASK-001 — Extend RunConfig with replica and cache-size fields)
Blocks: TASK-005 (Session-aware backend routing in proxy transports), TASK-006 (Wire LRU cache-size bound in proxy transports)

Acceptance Criteria

  • pkg/transport/session/storage_routing.go is created in the session package with package-level copyright header matching the project style (// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc. / // SPDX-License-Identifier: Apache-2.0)
  • RoutingStorage struct is unexported; only NewRoutingStorage and the Storage interface methods are exported
  • NewRoutingStorage(maxLocalEntries int, remote Storage) *RoutingStorage panics (or returns an error) when maxLocalEntries <= 0; a caller passing zero should be handled by the caller applying the 1000 default before calling this constructor
  • Store writes the session to remote first, then promotes the entry into the local LRU cache; if the remote write fails the local cache is not updated and the error is returned
  • Load checks the local LRU cache first; on a hit it returns the cached session without contacting remote; on a miss it fetches from remote, promotes the result into the local cache, and returns it; ErrSessionNotFound from remote propagates unchanged
  • Delete removes the entry from both the local LRU cache and remote; remote.Delete is called even if the local cache did not contain the entry; errors from remote.Delete are returned to the caller
  • DeleteExpired delegates entirely to remote.DeleteExpired; after the remote call completes without error, any entries in the local LRU whose session ID is also absent from the cache (i.e., cannot be found by a Load) are evicted — or alternatively the local cache is not modified (either approach is acceptable since the local cache will self-correct on the next Load)
  • Close calls remote.Close(), clears the local LRU cache, and returns any error from remote.Close()
  • All five methods satisfy the session.Storage interface (Store, Load, Delete, DeleteExpired, Close)
  • The type assertion var _ Storage = (*RoutingStorage)(nil) compiles without error (add as a compile-time check in the source file)
  • Unit tests in pkg/transport/session/storage_routing_test.go cover the scenarios listed in the Testing Strategy section below
  • All existing tests in pkg/transport/session/ continue to pass without modification
  • go vet ./pkg/transport/session/... reports no issues
  • The LRU implementation uses an existing dependency (e.g. github.com/hashicorp/golang-lru/v2) already present in go.sum; adding a new direct dependency requires team review before merging

Technical Approach

Recommended Implementation

Create a new file pkg/transport/session/storage_routing.go. The struct holds two fields: an LRU cache (keyed by string, valued as Session) and a remote Storage. The LRU cache is guarded by the package's existing concurrency conventions — use a thread-safe LRU from hashicorp/golang-lru/v2 (already present in go.sum as a transitive dependency) to avoid a manual mutex. For DeleteExpired, delegating entirely to remote is the correct behavior because Redis owns the TTL source of truth; the local cache will self-correct as sessions expire and are re-fetched on miss. The remote Storage implementation is provided by the vMCP epic (RC-6/RC-7) and injected at construction time — RoutingStorage has no direct Redis client dependency.

Patterns & Frameworks

  • Follow LocalStorage in pkg/transport/session/storage_local.go exactly for method signatures, doc comments, and error variable reuse (ErrSessionNotFound from errors.go)
  • Use github.com/hashicorp/golang-lru/v2 for the thread-safe LRU cache; prefer lru.New[string, Session](maxLocalEntries) from that package — the generic API avoids type assertions
  • The remote Storage field follows the same constructor-injection pattern used throughout ToolHive (compare with session.NewManagerWithStorage in manager.go:104)
  • Add a compile-time interface check var _ Storage = (*RoutingStorage)(nil) at the top of the file, consistent with other type assertions in the codebase
  • File-level SPDX header and package session declaration must be present; follow the existing file header in storage_local.go exactly

Code Pointers

  • pkg/transport/session/storage_local.go — Primary reference for implementing Storage; RoutingStorage follows the same method signatures, doc comment style, and error handling patterns; the Load logic (check → miss → return ErrSessionNotFound) is the template for the two-tier lookup
  • pkg/transport/session/storage.go — The Storage interface definition; all five methods must be satisfied
  • pkg/transport/session/errors.goErrSessionNotFound, ErrSessionAlreadyExists — reuse these sentinel errors rather than defining new ones
  • pkg/transport/session/manager.go:104NewManagerWithStorage shows how a custom Storage is injected; RoutingStorage is the Storage argument passed here by TASK-005
  • pkg/transport/session/storage_test.go — Test patterns: t.Parallel(), testify/assert + testify/require, context creation, and ErrSessionNotFound assertion style to follow in the new test file
  • pkg/runner/config.go (lines 46–213, updated by registry: Implement automatic tag bump #265) — RunConfig.SessionCacheSize is the configuration source; TASK-006 propagates it to NewRoutingStorage

Component Interfaces

// RoutingStorage is an LRU-bounded, two-tier session storage.
// Local LRU cache is checked first; on miss, falls back to the Redis Storage.
// Evicted local entries are transparently recovered from Redis on the next Load.
type RoutingStorage struct {
    local  *lru.Cache[string, Session] // thread-safe LRU from hashicorp/golang-lru/v2
    remote Storage
}

// NewRoutingStorage constructs a RoutingStorage.
//   maxLocalEntries: LRU cap; must be > 0
//   remote:          Redis-backed Storage from the vMCP epic (RC-6/RC-7)
func NewRoutingStorage(maxLocalEntries int, remote Storage) *RoutingStorage

// Storage interface methods — all five must be implemented:
func (r *RoutingStorage) Store(ctx context.Context, session Session) error
func (r *RoutingStorage) Load(ctx context.Context, id string) (Session, error)
func (r *RoutingStorage) Delete(ctx context.Context, id string) error
func (r *RoutingStorage) DeleteExpired(ctx context.Context, before time.Time) error
func (r *RoutingStorage) Close() error

// Compile-time interface guard (add near the top of the file):
var _ Storage = (*RoutingStorage)(nil)

Session-to-backend-URL metadata convention (shared with TASK-005):

"backend_pod" → pod name, e.g. "mcp-server-0"
"backend_url" → full pod URL, e.g. "http://mcp-server-0.mcp-server.default.svc:8080"

These keys are written into the session's metadata map via session.SetMetadata before the session is passed to Store. RoutingStorage does not parse or validate these keys — it stores and retrieves the Session object opaquely.

Testing Strategy

Create pkg/transport/session/storage_routing_test.go in package session. Use t.Parallel() on all top-level and sub-tests. Use testify/assert and testify/require following the existing style in storage_test.go. For the remote dependency, use NewLocalStorage() as a test double (it already satisfies Storage).

Unit Tests

  • Store then Load: store a session via RoutingStorage; verify the session is returned by Load without contacting remote (use a counter-wrapped Storage to assert zero remote Load calls after a local hit)
  • Load cache miss: create a RoutingStorage with an empty local cache but a populated remote; call Load; verify the session is returned and subsequently served from the local cache (no second remote call)
  • Load not found: call Load for an unknown session ID; assert ErrSessionNotFound is returned
  • Delete removes from both tiers: store a session; call Delete; verify Load returns ErrSessionNotFound and the remote storage also no longer contains the session
  • Delete non-existent: call Delete for an unknown ID; assert no error is returned (consistent with LocalStorage behavior)
  • Store remote failure: use a stub Storage that returns an error on Store; verify the local cache is not populated and the error is propagated
  • LRU eviction + recovery: construct a RoutingStorage with maxLocalEntries=2; store three sessions; verify the first session is no longer in the local cache but can still be recovered via Load (fetched from remote)
  • DeleteExpired delegates to remote: verify that expired sessions written to the underlying remote storage are cleaned up after DeleteExpired is called on RoutingStorage
  • Close closes remote: assert that remote.Close() is called; verify subsequent Load calls return errors consistent with closed storage

Integration Tests

  • No integration tests required for this task; the Redis-backed remote Storage is tested in the vMCP epic (RC-6/RC-7) and not available in this package

Edge Cases

  • maxLocalEntries <= 0 passed to NewRoutingStorage — assert a panic or non-nil error is returned depending on the chosen constructor contract
  • Concurrent Store and Load calls from multiple goroutines — run under -race to verify the thread-safe LRU prevents data races
  • Load after remote Store (bypass local cache): populate the remote directly and verify RoutingStorage.Load fetches and caches the session correctly

Out of Scope

  • Redis Storage implementation (RC-6/RC-7) — delivered by the vMCP epic; RoutingStorage accepts it as an injected Storage interface value
  • Wiring RoutingStorage into the proxy transports (TASK-005)
  • Propagating SessionCacheSize to NewRoutingStorage in the proxy constructors (TASK-006)
  • Load balancing / round-robin selection of backend pods on initialize — handled in TASK-005
  • Per-session backend URL assignment logic — handled in TASK-005
  • Changes to StatefulSet replica count wiring (TASK-002) or graceful shutdown (TASK-003)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    scalabilityItems related to scalabilityvmcpVirtual MCP Server related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions