Skip to content

[FEATURE]: Session affinity for stateful MCP workflows (REQ-005) #1986

@crivetimihai

Description

@crivetimihai

🔗 Feature: Session Affinity for Stateful MCP Workflows (REQ-005)

Goal

Bind all user interactions within a session to a single upstream MCP server instance. This enables stateful agentic workflows and elicitation flows where upstream servers maintain session state across multiple tool calls.

Why Now?

  1. Stateful Agents: AI agents increasingly need to maintain conversation state across tool calls
  2. Elicitation Routing: Elicitation requests must reach the correct user session
  3. Multi-Worker Deployments: Horizontal scaling breaks session locality without affinity
  4. MCP Server State: Some MCP servers store context that must persist across calls

📖 User Stories

US-1: AI Agent - Maintain Context Across Tool Calls

As an AI Agent executing multi-step workflows
I want all my tool calls routed to the same upstream MCP server
So that the server can maintain state between calls

Acceptance Criteria:

Scenario: Sequential tool calls maintain session affinity
  Given a user starts a session via SSE transport
  And the session is assigned to upstream MCP server "server-A"
  When the user calls tool "start_task" 
  Then the call should route to "server-A"
  When the user calls tool "check_status"
  Then the call should also route to "server-A"
  And both calls should see the same server-side state

Scenario: New session gets fresh assignment
  Given user A has affinity to "server-A"
  When user B starts a new session
  Then user B may be assigned to "server-A" or "server-B"
  And user B's assignment is independent of user A

Technical Requirements:

  • Create SessionAffinityService mapping downstream_session_id -> upstream_pool_key
  • Pin upstream session on first call
  • Reuse pinned session for all subsequent calls
US-2: Gateway - Route Elicitation to Correct Session

As a Gateway administrator
I want elicitation requests routed to the originating user session
So that the correct user receives interactive prompts

Acceptance Criteria:

Scenario: Elicitation reaches correct user
  Given user A is executing tool "confirm_delete" via session S1
  And user B has active session S2
  When the MCP server sends elicitation/create
  Then the request should route ONLY to session S1
  And user B (S2) should NOT see the elicitation

Scenario: Elicitation with session affinity
  Given session S1 has affinity to upstream server US1
  When US1 sends elicitation/create during tool execution
  Then the response from S1 should return to US1
  And the affinity should be maintained

Technical Requirements:

  • Integrate affinity service with elicitation routing
  • Use session mapping to route elicitation responses
US-3: Operator - Multi-Worker Session Affinity

As a Platform Operator running multiple gateway workers
I want session affinity to work across workers
So that load balancing doesn't break session state

Acceptance Criteria:

Scenario: Cross-worker session routing
  Given 3 gateway workers behind a load balancer
  And user session S1 was established on worker W1
  When subsequent request for S1 hits worker W2
  Then W2 should lookup affinity in Redis
  And route the request to the correct upstream server
  And optionally redirect to W1 for optimal performance

Scenario: Worker failure recovery
  Given session S1 has affinity to upstream US1 via worker W1
  When W1 fails
  And request arrives at W2
  Then W2 should re-establish affinity to US1
  And log the rebind event
  And continue processing requests

Technical Requirements:

  • Store affinity mappings in Redis for cross-worker visibility
  • Track downstream_session_id -> worker_id for worker affinity
  • Implement graceful rebind on worker failure
US-4: Developer - Enable/Disable Affinity Per Gateway

As a Developer configuring gateways
I want to enable session affinity for specific gateways
So that I can use it only where needed

Acceptance Criteria:

Scenario: Enable session affinity globally
  Given MCPGATEWAY_SESSION_AFFINITY_ENABLED=true
  When a new session is created
  Then session affinity should be tracked
  And all tool calls should maintain affinity

Scenario: Affinity disabled by default
  Given MCPGATEWAY_SESSION_AFFINITY_ENABLED=false (default)
  When tool calls are made
  Then requests may route to any available upstream session
  And no affinity tracking overhead occurs

Scenario: Configure affinity TTL
  Given MCPGATEWAY_SESSION_AFFINITY_TTL=3600
  When a session has no activity for 1 hour
  Then the affinity mapping should expire
  And next request gets fresh assignment

Technical Requirements:

  • Add MCPGATEWAY_SESSION_AFFINITY_ENABLED config (default: false)
  • Add MCPGATEWAY_SESSION_AFFINITY_TTL config (default: 3600s)
  • Opt-in behavior to avoid overhead for stateless use cases

🏗 Architecture

Session Affinity Flow

sequenceDiagram
    participant Client
    participant Gateway
    participant AffinityService
    participant Redis
    participant MCP Server

    Client->>Gateway: Connect (session: S1)
    Gateway->>AffinityService: Check affinity for S1
    AffinityService->>Redis: GET affinity:S1
    Redis-->>AffinityService: null (no affinity)
    
    Client->>Gateway: tools/call "start_task"
    Gateway->>AffinityService: Get/Create affinity for S1
    AffinityService->>MCP Server: Execute on US1
    AffinityService->>Redis: SET affinity:S1 = US1 (TTL: 3600)
    MCP Server-->>Gateway: Result
    
    Client->>Gateway: tools/call "check_status"
    Gateway->>AffinityService: Get affinity for S1
    AffinityService->>Redis: GET affinity:S1
    Redis-->>AffinityService: US1
    AffinityService->>MCP Server: Execute on US1 (same server!)
    MCP Server-->>Gateway: Result (with state from first call)
Loading

Affinity Service Design

classDiagram
    class SessionAffinityService {
        -redis_client: Redis
        -local_cache: Dict
        -ttl: int
        +get_affinity(session_id) Optional~str~
        +set_affinity(session_id, upstream_key)
        +remove_affinity(session_id)
        +rebind_affinity(session_id, new_upstream)
    }
    
    class AffinityMapping {
        +downstream_session_id: str
        +upstream_pool_key: str
        +worker_id: Optional~str~
        +created_at: datetime
        +last_used: datetime
    }
Loading

📋 Implementation Tasks

Phase 1: Session ID Propagation

  • Add x-mcp-session-id to DEFAULT_IDENTITY_HEADERS
  • Inject session ID header in generate_response() for SSE
  • Inject session ID in streamable HTTP context vars
  • Pass session ID through tool invocation path

Phase 2: Affinity Service (In-Memory)

  • Create mcpgateway/services/session_affinity_service.py
  • Implement SessionAffinityService class
  • Add in-memory storage for single-worker deployments
  • Implement get/set/remove affinity methods
  • Add TTL-based expiration

Phase 3: Redis Backend

  • Add Redis storage backend for affinity mappings
  • Use Redis for cross-worker visibility
  • Implement atomic set-if-not-exists for initial binding
  • Add TTL support via Redis EXPIRE

Phase 4: Tool Service Integration

  • Modify tool invocation to check affinity first
  • Create affinity on first tool call
  • Reuse affinity for subsequent calls
  • Handle affinity miss (upstream unavailable)

Phase 5: Elicitation Integration

  • Use affinity service in elicitation routing
  • Ensure elicitation responses maintain affinity
  • Add cleanup hooks in SessionRegistry.remove_session()

Phase 6: Worker Affinity

  • Track session_id -> worker_id mapping
  • Add worker health checking
  • Implement graceful rebind on worker failure
  • Add rebind logging and metrics

Phase 7: Metrics

  • Add session_affinity_bindings_active gauge
  • Add session_affinity_hits_total counter
  • Add session_affinity_misses_total counter
  • Add session_affinity_rebinds_total counter
  • Add session_affinity_failures_total counter

Phase 8: Testing

  • Unit tests for affinity service
  • Unit tests for Redis backend
  • Integration tests for multi-call affinity
  • Integration tests for elicitation routing
  • Integration tests for worker failover

⚙️ Configuration Example

# Enable session affinity (opt-in)
MCPGATEWAY_SESSION_AFFINITY_ENABLED=false

# TTL for affinity mappings (seconds)
MCPGATEWAY_SESSION_AFFINITY_TTL=3600

# Redis required for multi-worker affinity
REDIS_URL=redis://localhost:6379

# Example: Enable for stateful workflows
# MCPGATEWAY_SESSION_AFFINITY_ENABLED=true
# MCPGATEWAY_SESSION_AFFINITY_TTL=7200

✅ Success Criteria

  • Downstream SSE session maintains affinity across multiple tool calls
  • Downstream streamable HTTP session maintains affinity
  • Elicitation requests route to originating user session
  • Multi-worker deployments maintain affinity via Redis
  • Graceful rebind on upstream session failure
  • Metrics exposed for monitoring affinity behavior
  • Configuration toggles work correctly
  • No performance regression when affinity disabled
  • All integration tests pass

🏁 Definition of Done

  • Session ID propagation implemented
  • Affinity service with in-memory backend working
  • Redis backend implemented
  • Tool service integration complete
  • Elicitation integration complete
  • Worker affinity implemented
  • Metrics added and exposed
  • Unit tests written and passing
  • Integration tests written and passing
  • Code passes make verify
  • Configuration documented in .env.example
  • PR reviewed and approved

📝 Additional Notes

Design Decisions

Decision Resolution Rationale
Affinity scope Per session (not per user) Users may have multiple sessions
Concurrency Serialize by default Prevents race conditions in stateful servers
Rebind strategy Log, carry over context Graceful degradation preferred
Default state Opt-in via config Avoid overhead for stateless use cases

Performance Considerations

  • In-memory cache for hot path (< 1ms lookup)
  • Redis fallback for cross-worker (< 5ms)
  • TTL prevents unbounded memory growth
  • No affinity tracking when disabled (zero overhead)

🔗 Related Issues

  • Design document: todo/session-affinity.md
  • mcpgateway/services/mcp_session_pool.py - Session pool
  • mcpgateway/cache/session_registry.py - Session registry
  • llmchat_router Redis worker affinity pattern

Metadata

Metadata

Assignees

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafeenhancementNew feature or requestwxowxo integration

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions