Skip to content

[PERFORMANCE]: Database Retry Mechanism for High-Concurrency Resilience #1853

@crivetimihai

Description

@crivetimihai

[PERFORMANCE]: Database Retry Mechanism for High-Concurrency Resilience

Summary

Add retry logic with exponential backoff to database session acquisition (get_db(), fresh_db_session()) to enable automatic recovery from transient connection failures under high load. Currently, the application handles connection errors gracefully (via ResilientSession rollback) but fails requests immediately without retry, preventing recovery under sustained load.

Problem Statement

To ensure resilience under sustained high-concurrency workloads, the platform needs automatic recovery mechanisms for transient database connection failures. Current behavior analysis shows:

  1. Pool exhaustion scenarios can trigger query_wait_timeout errors under heavy load
  2. ResilientSession correctly rolls back on connection errors but does not retry
  3. No backpressure mechanism - failed requests return errors immediately without retry
  4. Pool contamination risk - once connections start failing, new acquisitions may also fail
  5. No circuit breaker for database - circuit breakers exist for MCP sessions and tools, but not for DB operations

Without retry and circuit breaker patterns for database operations, the system requires manual intervention or load reduction to recover from transient connection issues.

Current State Analysis

What Exists

Component Status Location
ResilientHttpClient Full retry with backoff mcpgateway/utils/retry_manager.py
ResilientSession Rollback only, no retry mcpgateway/db.py:305-445
get_db() No retry logic mcpgateway/db.py:5270-5307
fresh_db_session() No retry logic mcpgateway/db.py:5362-5404
DB startup retry Exponential backoff mcpgateway/utils/db_isready.py
MCP Session Pool Circuit Breaker Fully implemented mcpgateway/services/mcp_session_pool.py:373-431
Tool Circuit Breaker Plugin Fully implemented plugins/circuit_breaker/circuit_breaker.py

Configuration Status

Setting Default Current Usage
db_max_retries 30 Startup only (wait_for_db_ready())
db_retry_interval_ms 2000 Startup only (wait_for_db_ready())
db_max_backoff_seconds 30 Startup only (wait_for_db_ready())
mcp_session_pool_circuit_breaker_threshold 5 Runtime (MCP sessions)
mcp_session_pool_circuit_breaker_reset 60.0 Runtime (MCP sessions)

Gap Analysis

The existing db_max_retries, db_retry_interval_ms, and db_max_backoff_seconds settings are only used at application startup in wait_for_db_ready(). These same settings should be reused for runtime retry logic.

Proposed Solution

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         DATABASE RETRY ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐                                                           │
│  │   Request    │                                                           │
│  └──────┬───────┘                                                           │
│         │                                                                   │
│         ▼                                                                   │
│  ┌──────────────────┐     ┌─────────────────┐                               │
│  │  Circuit Breaker │────▶│  Fast Fail 503  │  (if circuit open)           │
│  └──────┬───────────┘     └─────────────────┘                               │
│         │ (circuit closed)                                                  │
│         ▼                                                                   │
│  ┌──────────────────┐                                                       │
│  │   get_db()       │  ← Add retry wrapper                                  │
│  └──────┬───────────┘                                                       │
│         │                                                                   │
│         ▼                                                                   │
│  ┌──────────────────┐     ┌─────────────────┐                               │
│  │  SessionLocal()  │────▶│  Pool Timeout?  │                               │
│  └──────────────────┘     └────────┬────────┘                               │
│                                    │                                        │
│                    ┌───────────────┴───────────────┐                        │
│                    │                               │                        │
│                    ▼                               ▼                        │
│            ┌──────────────┐               ┌──────────────┐                  │
│            │   Success    │               │ Retry Logic  │                  │
│            │  (continue)  │               │  (backoff)   │                  │
│            └──────────────┘               └──────┬───────┘                  │
│                                                  │                          │
│                                    ┌─────────────┴─────────────┐            │
│                                    │                           │            │
│                                    ▼                           ▼            │
│                           ┌──────────────┐           ┌──────────────┐       │
│                           │  Retry OK    │           │ Max Retries  │       │
│                           │  (continue)  │           │ (open circuit│       │
│                           └──────────────┘           │  → 503)      │       │
│                                                      └──────────────┘       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Components

  1. Database Retry Decorator (mcpgateway/utils/db_retry.py) - NEW FILE

    • with_db_retry() decorator for synchronous session acquisition
    • Reuse existing db_max_retries, db_retry_interval_ms, db_max_backoff_seconds config
    • Exponential backoff with ±25% jitter (matching db_isready.py pattern)
    • Retriable error detection (pool timeout, query_wait_timeout, OperationalError, etc.)
  2. Database Circuit Breaker - NEW (follow existing patterns)

    • Reuse patterns from mcp_session_pool.py circuit breaker implementation
    • Three states: CLOSED (normal), OPEN (fast-fail), HALF_OPEN (testing recovery)
    • Integrate with /health endpoint for observability
  3. Updated get_db() and fresh_db_session() with retry wrapper

Configuration Updates

# Existing settings - extend usage to runtime (no changes to defaults)
db_max_retries: int = 30           # Already exists, enable at runtime
db_retry_interval_ms: int = 2000   # Already exists, enable at runtime
db_max_backoff_seconds: int = 30   # Already exists, enable at runtime

# Database Circuit Breaker Configuration (NEW - matches MCP session pool pattern)
db_circuit_enabled: bool = True
db_circuit_failure_threshold: int = 5   # Matches mcp_session_pool default
db_circuit_reset_seconds: float = 60.0  # Matches mcp_session_pool default

Environment Variables

# Existing (extend to runtime)
DB_MAX_RETRIES=30
DB_RETRY_INTERVAL_MS=2000
DB_MAX_BACKOFF_SECONDS=30

# New circuit breaker settings
DB_CIRCUIT_ENABLED=true
DB_CIRCUIT_FAILURE_THRESHOLD=5
DB_CIRCUIT_RESET_SECONDS=60.0

Files to Create

File Purpose
mcpgateway/utils/db_retry.py Retry decorator with exponential backoff (reference db_isready.py pattern)
tests/unit/mcpgateway/utils/test_db_retry.py Unit tests for retry logic
tests/integration/test_db_recovery.py Integration tests for recovery scenarios

Files to Modify

File Changes
mcpgateway/db.py Add retry logic to get_db(), fresh_db_session(), add circuit breaker
mcpgateway/config.py Add db_circuit_* settings
mcpgateway/admin.py Expose circuit breaker status in pool stats
.env.example Document new circuit breaker settings
docker-compose.yml Add default values for new settings
charts/mcp-stack/values.yaml Add Helm values for circuit breaker
docs/docs/manage/configuration.md Document retry and circuit breaker behavior

Implementation Notes

Retry Logic Pattern (from db_isready.py)

# Exponential backoff with jitter - existing pattern to reuse
delay = min(interval * (2 ** (attempt - 1)), max_backoff)
jitter = delay * 0.25 * (2 * random.random() - 1)  # ±25%
actual_delay = delay + jitter

Circuit Breaker Pattern (from mcp_session_pool.py)

# Existing pattern to adapt for database operations
def _is_circuit_open(self, key: str) -> bool:
    if key in self._circuit_open_until:
        if time.time() < self._circuit_open_until[key]:
            return True
        del self._circuit_open_until[key]
        self._failures[key] = 0
    return False

def _record_failure(self, key: str) -> None:
    self._failures[key] += 1
    if self._failures[key] >= self._circuit_breaker_threshold:
        self._circuit_open_until[key] = time.time() + self._circuit_breaker_reset

Acceptance Criteria

  • with_db_retry decorator implemented using existing backoff pattern from db_isready.py
  • Database circuit breaker implemented following mcp_session_pool.py pattern
  • get_db() uses retry logic with circuit breaker
  • fresh_db_session() uses retry logic with circuit breaker
  • Existing db_max_retries, db_retry_interval_ms, db_max_backoff_seconds now used at runtime
  • Circuit breaker state exposed in /admin/pool-stats response
  • New db_circuit_* configuration documented in .env.example
  • Configuration added to docker-compose.yml
  • Configuration added to Helm chart values
  • Documentation updated
  • Unit tests pass
  • Integration tests pass
  • Load test validates recovery from transient pool exhaustion

References

  • Existing retry implementation: mcpgateway/utils/db_isready.py
  • Existing HTTP retry: mcpgateway/utils/retry_manager.py
  • MCP session circuit breaker: mcpgateway/services/mcp_session_pool.py:373-431
  • Tool circuit breaker plugin: plugins/circuit_breaker/circuit_breaker.py
  • ResilientSession (rollback-only): mcpgateway/db.py:305-445

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseenhancementNew feature or requestperformancePerformance related items

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions