-
Notifications
You must be signed in to change notification settings - Fork 614
[PERFORMANCE]: Database Retry Mechanism for High-Concurrency Resilience #1853
Description
[PERFORMANCE]: Database Retry Mechanism for High-Concurrency Resilience
Summary
Add retry logic with exponential backoff to database session acquisition (get_db(), fresh_db_session()) to enable automatic recovery from transient connection failures under high load. Currently, the application handles connection errors gracefully (via ResilientSession rollback) but fails requests immediately without retry, preventing recovery under sustained load.
Problem Statement
To ensure resilience under sustained high-concurrency workloads, the platform needs automatic recovery mechanisms for transient database connection failures. Current behavior analysis shows:
- Pool exhaustion scenarios can trigger
query_wait_timeouterrors under heavy load ResilientSessioncorrectly rolls back on connection errors but does not retry- No backpressure mechanism - failed requests return errors immediately without retry
- Pool contamination risk - once connections start failing, new acquisitions may also fail
- No circuit breaker for database - circuit breakers exist for MCP sessions and tools, but not for DB operations
Without retry and circuit breaker patterns for database operations, the system requires manual intervention or load reduction to recover from transient connection issues.
Current State Analysis
What Exists
| Component | Status | Location |
|---|---|---|
ResilientHttpClient |
Full retry with backoff | mcpgateway/utils/retry_manager.py |
ResilientSession |
Rollback only, no retry | mcpgateway/db.py:305-445 |
get_db() |
No retry logic | mcpgateway/db.py:5270-5307 |
fresh_db_session() |
No retry logic | mcpgateway/db.py:5362-5404 |
| DB startup retry | Exponential backoff | mcpgateway/utils/db_isready.py |
| MCP Session Pool Circuit Breaker | Fully implemented | mcpgateway/services/mcp_session_pool.py:373-431 |
| Tool Circuit Breaker Plugin | Fully implemented | plugins/circuit_breaker/circuit_breaker.py |
Configuration Status
| Setting | Default | Current Usage |
|---|---|---|
db_max_retries |
30 | Startup only (wait_for_db_ready()) |
db_retry_interval_ms |
2000 | Startup only (wait_for_db_ready()) |
db_max_backoff_seconds |
30 | Startup only (wait_for_db_ready()) |
mcp_session_pool_circuit_breaker_threshold |
5 | Runtime (MCP sessions) |
mcp_session_pool_circuit_breaker_reset |
60.0 | Runtime (MCP sessions) |
Gap Analysis
The existing db_max_retries, db_retry_interval_ms, and db_max_backoff_seconds settings are only used at application startup in wait_for_db_ready(). These same settings should be reused for runtime retry logic.
Proposed Solution
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATABASE RETRY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Request │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ ┌─────────────────┐ │
│ │ Circuit Breaker │────▶│ Fast Fail 503 │ (if circuit open) │
│ └──────┬───────────┘ └─────────────────┘ │
│ │ (circuit closed) │
│ ▼ │
│ ┌──────────────────┐ │
│ │ get_db() │ ← Add retry wrapper │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ ┌─────────────────┐ │
│ │ SessionLocal() │────▶│ Pool Timeout? │ │
│ └──────────────────┘ └────────┬────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Success │ │ Retry Logic │ │
│ │ (continue) │ │ (backoff) │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌─────────────┴─────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Retry OK │ │ Max Retries │ │
│ │ (continue) │ │ (open circuit│ │
│ └──────────────┘ │ → 503) │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation Components
-
Database Retry Decorator (
mcpgateway/utils/db_retry.py) - NEW FILEwith_db_retry()decorator for synchronous session acquisition- Reuse existing
db_max_retries,db_retry_interval_ms,db_max_backoff_secondsconfig - Exponential backoff with ±25% jitter (matching
db_isready.pypattern) - Retriable error detection (pool timeout, query_wait_timeout, OperationalError, etc.)
-
Database Circuit Breaker - NEW (follow existing patterns)
- Reuse patterns from
mcp_session_pool.pycircuit breaker implementation - Three states: CLOSED (normal), OPEN (fast-fail), HALF_OPEN (testing recovery)
- Integrate with
/healthendpoint for observability
- Reuse patterns from
-
Updated
get_db()andfresh_db_session()with retry wrapper
Configuration Updates
# Existing settings - extend usage to runtime (no changes to defaults)
db_max_retries: int = 30 # Already exists, enable at runtime
db_retry_interval_ms: int = 2000 # Already exists, enable at runtime
db_max_backoff_seconds: int = 30 # Already exists, enable at runtime
# Database Circuit Breaker Configuration (NEW - matches MCP session pool pattern)
db_circuit_enabled: bool = True
db_circuit_failure_threshold: int = 5 # Matches mcp_session_pool default
db_circuit_reset_seconds: float = 60.0 # Matches mcp_session_pool defaultEnvironment Variables
# Existing (extend to runtime)
DB_MAX_RETRIES=30
DB_RETRY_INTERVAL_MS=2000
DB_MAX_BACKOFF_SECONDS=30
# New circuit breaker settings
DB_CIRCUIT_ENABLED=true
DB_CIRCUIT_FAILURE_THRESHOLD=5
DB_CIRCUIT_RESET_SECONDS=60.0Files to Create
| File | Purpose |
|---|---|
mcpgateway/utils/db_retry.py |
Retry decorator with exponential backoff (reference db_isready.py pattern) |
tests/unit/mcpgateway/utils/test_db_retry.py |
Unit tests for retry logic |
tests/integration/test_db_recovery.py |
Integration tests for recovery scenarios |
Files to Modify
| File | Changes |
|---|---|
mcpgateway/db.py |
Add retry logic to get_db(), fresh_db_session(), add circuit breaker |
mcpgateway/config.py |
Add db_circuit_* settings |
mcpgateway/admin.py |
Expose circuit breaker status in pool stats |
.env.example |
Document new circuit breaker settings |
docker-compose.yml |
Add default values for new settings |
charts/mcp-stack/values.yaml |
Add Helm values for circuit breaker |
docs/docs/manage/configuration.md |
Document retry and circuit breaker behavior |
Implementation Notes
Retry Logic Pattern (from db_isready.py)
# Exponential backoff with jitter - existing pattern to reuse
delay = min(interval * (2 ** (attempt - 1)), max_backoff)
jitter = delay * 0.25 * (2 * random.random() - 1) # ±25%
actual_delay = delay + jitterCircuit Breaker Pattern (from mcp_session_pool.py)
# Existing pattern to adapt for database operations
def _is_circuit_open(self, key: str) -> bool:
if key in self._circuit_open_until:
if time.time() < self._circuit_open_until[key]:
return True
del self._circuit_open_until[key]
self._failures[key] = 0
return False
def _record_failure(self, key: str) -> None:
self._failures[key] += 1
if self._failures[key] >= self._circuit_breaker_threshold:
self._circuit_open_until[key] = time.time() + self._circuit_breaker_resetAcceptance Criteria
-
with_db_retrydecorator implemented using existing backoff pattern fromdb_isready.py - Database circuit breaker implemented following
mcp_session_pool.pypattern -
get_db()uses retry logic with circuit breaker -
fresh_db_session()uses retry logic with circuit breaker - Existing
db_max_retries,db_retry_interval_ms,db_max_backoff_secondsnow used at runtime - Circuit breaker state exposed in
/admin/pool-statsresponse - New
db_circuit_*configuration documented in.env.example - Configuration added to
docker-compose.yml - Configuration added to Helm chart values
- Documentation updated
- Unit tests pass
- Integration tests pass
- Load test validates recovery from transient pool exhaustion
References
- Existing retry implementation:
mcpgateway/utils/db_isready.py - Existing HTTP retry:
mcpgateway/utils/retry_manager.py - MCP session circuit breaker:
mcpgateway/services/mcp_session_pool.py:373-431 - Tool circuit breaker plugin:
plugins/circuit_breaker/circuit_breaker.py - ResilientSession (rollback-only):
mcpgateway/db.py:305-445