-
Notifications
You must be signed in to change notification settings - Fork 614
[PERFORMANCE]: Connection Pool Health Monitoring and Readiness Integration #1856
Copy link
Copy link
Open
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseenhancementNew feature or requestNew feature or requestperformancePerformance related itemsPerformance related items
Milestone
Description
Summary
Add comprehensive connection pool health monitoring with Prometheus metrics and integrate pool health into Kubernetes readiness probes. This enables early warning before pool exhaustion and automatic traffic shedding from struggling replicas.
Problem Statement
Under sustained high-concurrency load, connection pool exhaustion can occur without warning. Current behavior:
- No visibility into pool state - No metrics expose pool utilization until failure
- Readiness probes ignore pool health - Replicas with exhausted pools continue receiving traffic
- No early warning - Alerts can only fire after errors occur, not before
- All replicas fail together - Load balancer can't route away from struggling instances
This results in cascade failures where all replicas become degraded simultaneously instead of graceful degradation where healthy replicas absorb traffic from unhealthy ones.
Current State
Existing Health Endpoints
| Endpoint | Purpose | Pool Awareness |
|---|---|---|
/health |
Liveness check | ❌ No |
/ready |
Readiness check | ❌ No |
/metrics |
Prometheus metrics | ❌ No pool metrics |
Current /ready Implementation
# mcpgateway/main.py
@app.get("/ready")
async def ready():
# Only checks if app is running, not pool health
return {"status": "ready"}Proposed Solution
1. Connection Pool Metrics
Expose SQLAlchemy pool state via Prometheus:
# mcpgateway/metrics/pool_metrics.py
from prometheus_client import Gauge, Counter
# Pool utilization gauges
db_pool_size = Gauge(
"db_pool_size",
"Total size of the database connection pool"
)
db_pool_checked_out = Gauge(
"db_pool_checked_out",
"Number of connections currently checked out from pool"
)
db_pool_overflow = Gauge(
"db_pool_overflow",
"Number of overflow connections currently in use"
)
db_pool_checked_in = Gauge(
"db_pool_checked_in",
"Number of connections available in pool"
)
db_pool_utilization = Gauge(
"db_pool_utilization",
"Pool utilization ratio (0.0-1.0)"
)
# Pool event counters
db_pool_timeout_total = Counter(
"db_pool_timeout_total",
"Total number of pool timeout errors"
)
db_pool_checkout_total = Counter(
"db_pool_checkout_total",
"Total connection checkouts from pool"
)
db_pool_invalidated_total = Counter(
"db_pool_invalidated_total",
"Total connections invalidated"
)
def update_pool_metrics(engine):
"""Update pool metrics from SQLAlchemy engine."""
pool = engine.pool
size = pool.size()
checked_out = pool.checkedout()
overflow = pool.overflow()
checked_in = pool.checkedin()
db_pool_size.set(size)
db_pool_checked_out.set(checked_out)
db_pool_overflow.set(overflow)
db_pool_checked_in.set(checked_in)
# Utilization = checked_out / (size + max_overflow)
max_capacity = size + pool._max_overflow
utilization = checked_out / max_capacity if max_capacity > 0 else 0
db_pool_utilization.set(utilization)2. Pool Event Listeners
Capture pool events for detailed observability:
# mcpgateway/db.py - Add event listeners
from sqlalchemy import event
@event.listens_for(engine, "checkout")
def on_checkout(dbapi_conn, connection_record, connection_proxy):
"""Track connection checkouts."""
db_pool_checkout_total.inc()
update_pool_metrics(engine)
@event.listens_for(engine, "checkin")
def on_checkin(dbapi_conn, connection_record):
"""Track connection returns."""
update_pool_metrics(engine)
@event.listens_for(engine, "invalidate")
def on_invalidate(dbapi_conn, connection_record, exception):
"""Track connection invalidations."""
db_pool_invalidated_total.inc()
update_pool_metrics(engine)3. Readiness Probe with Pool Health
# mcpgateway/main.py
from mcpgateway.config import settings
@app.get("/ready")
async def ready():
"""Readiness check including connection pool health.
Fails readiness when pool utilization exceeds threshold,
causing load balancer to stop routing traffic to this replica.
"""
pool = engine.pool
# Calculate pool utilization
checked_out = pool.checkedout()
max_capacity = pool.size() + pool._max_overflow
utilization = checked_out / max_capacity if max_capacity > 0 else 0
pool_healthy = utilization < settings.readiness_pool_threshold
# Optional: Test actual connectivity
db_connectable = True
if settings.readiness_check_db_connection:
try:
with engine.connect() as conn:
conn.execute(text("SELECT 1"))
except Exception:
db_connectable = False
is_ready = pool_healthy and db_connectable
status = {
"status": "ready" if is_ready else "not_ready",
"pool": {
"utilization": round(utilization, 2),
"checked_out": checked_out,
"capacity": max_capacity,
"healthy": pool_healthy,
},
"database": {
"connectable": db_connectable,
}
}
if not is_ready:
return JSONResponse(status_code=503, content=status)
return status4. Enhanced Health Endpoint
@app.get("/health")
async def health():
"""Liveness check with optional pool details."""
pool = engine.pool
checked_out = pool.checkedout()
max_capacity = pool.size() + pool._max_overflow
utilization = checked_out / max_capacity if max_capacity > 0 else 0
return {
"status": "healthy",
"pool": {
"size": pool.size(),
"checked_out": checked_out,
"overflow": pool.overflow(),
"checked_in": pool.checkedin(),
"utilization": round(utilization, 2),
"max_overflow": pool._max_overflow,
}
}5. Configuration
# mcpgateway/config.py
# Pool health monitoring
pool_metrics_enabled: bool = Field(
default=True,
description="Enable Prometheus metrics for connection pool"
)
pool_metrics_interval: float = Field(
default=5.0,
description="Interval in seconds to update pool metrics"
)
# Readiness probe configuration
readiness_pool_threshold: float = Field(
default=0.85,
ge=0.0,
le=1.0,
description="Pool utilization threshold for readiness (0.85 = fail at 85%)"
)
readiness_check_db_connection: bool = Field(
default=False,
description="Perform actual DB query in readiness check (adds latency)"
)6. Environment Variables
# Pool Metrics
POOL_METRICS_ENABLED=true
POOL_METRICS_INTERVAL=5.0
# Readiness Probe
READINESS_POOL_THRESHOLD=0.85
READINESS_CHECK_DB_CONNECTION=falsePrometheus Alerting Rules
# alerts/pool_alerts.yaml
groups:
- name: database_pool
rules:
# Early warning - pool getting busy
- alert: DatabasePoolHighUtilization
expr: db_pool_utilization > 0.7
for: 1m
labels:
severity: warning
annotations:
summary: "Database pool utilization above 70%"
description: "Pool utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
# Critical - pool near exhaustion
- alert: DatabasePoolCritical
expr: db_pool_utilization > 0.9
for: 30s
labels:
severity: critical
annotations:
summary: "Database pool near exhaustion"
description: "Pool utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
# Pool timeout errors occurring
- alert: DatabasePoolTimeouts
expr: rate(db_pool_timeout_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database pool timeout errors"
description: "Pool timeouts occurring at {{ $value }}/s on {{ $labels.instance }}"
# High connection invalidation rate
- alert: DatabaseConnectionInvalidations
expr: rate(db_pool_invalidated_total[5m]) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "High database connection invalidation rate"
description: "Connections being invalidated at {{ $value }}/s on {{ $labels.instance }}"Grafana Dashboard Queries
# Pool utilization over time
db_pool_utilization
# Available connections
db_pool_checked_in
# Checkout rate
rate(db_pool_checkout_total[1m])
# Timeout rate
rate(db_pool_timeout_total[5m])
# Pool capacity vs usage
db_pool_checked_out / (db_pool_size + ignoring(instance) group_left db_pool_overflow)
How This Helps Recovery
Before (Current Behavior)
High Load → All 3 replicas exhaust pools → All fail together → No recovery
↓
Load balancer keeps sending to all
↓
System stays degraded until load removed
After (With Pool Health in Readiness)
High Load → Replica 1 pool at 85% → Fails readiness check
↓
Load balancer stops sending traffic to Replica 1
↓
Replica 1 pool drains, recovers
↓
Replica 1 passes readiness, rejoins
↓
Traffic redistributes across healthy replicas
Files to Create
| File | Purpose |
|---|---|
mcpgateway/metrics/pool_metrics.py |
Pool metric definitions and update logic |
tests/unit/mcpgateway/metrics/test_pool_metrics.py |
Unit tests |
infra/prometheus/alerts/pool_alerts.yaml |
Alerting rules |
infra/grafana/dashboards/pool_dashboard.json |
Grafana dashboard |
Files to Modify
| File | Changes |
|---|---|
mcpgateway/db.py |
Add pool event listeners |
mcpgateway/main.py |
Update /health and /ready endpoints |
mcpgateway/config.py |
Add pool monitoring settings |
.env.example |
Document new environment variables |
docker-compose.yml |
Add default configuration |
charts/mcp-stack/values.yaml |
Add Helm values |
docs/docs/manage/configuration.md |
Document pool monitoring |
Acceptance Criteria
- Pool metrics exposed at
/metricsendpoint - Pool event listeners capture checkout/checkin/invalidate events
-
/readyfails when pool utilization exceeds threshold -
/healthincludes pool status details - Configuration via environment variables
- Prometheus alerting rules provided
- Grafana dashboard provided
- Unit tests for metric collection
- Integration test: verify readiness fails under pool pressure
- Documentation updated
- Load test verifies traffic sheds to healthy replicas
Performance Considerations
- Pool metric collection is lightweight (reads pool state, no DB queries)
- Event listeners add negligible overhead
- Readiness check with
READINESS_CHECK_DB_CONNECTION=falsehas no DB overhead - Metrics update interval configurable to balance freshness vs overhead
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseenhancementNew feature or requestNew feature or requestperformancePerformance related itemsPerformance related items