Skip to content

[PERFORMANCE]: Connection Pool Health Monitoring and Readiness Integration #1856

@crivetimihai

Description

@crivetimihai

Summary

Add comprehensive connection pool health monitoring with Prometheus metrics and integrate pool health into Kubernetes readiness probes. This enables early warning before pool exhaustion and automatic traffic shedding from struggling replicas.

Problem Statement

Under sustained high-concurrency load, connection pool exhaustion can occur without warning. Current behavior:

  1. No visibility into pool state - No metrics expose pool utilization until failure
  2. Readiness probes ignore pool health - Replicas with exhausted pools continue receiving traffic
  3. No early warning - Alerts can only fire after errors occur, not before
  4. All replicas fail together - Load balancer can't route away from struggling instances

This results in cascade failures where all replicas become degraded simultaneously instead of graceful degradation where healthy replicas absorb traffic from unhealthy ones.

Current State

Existing Health Endpoints

Endpoint Purpose Pool Awareness
/health Liveness check ❌ No
/ready Readiness check ❌ No
/metrics Prometheus metrics ❌ No pool metrics

Current /ready Implementation

# mcpgateway/main.py
@app.get("/ready")
async def ready():
    # Only checks if app is running, not pool health
    return {"status": "ready"}

Proposed Solution

1. Connection Pool Metrics

Expose SQLAlchemy pool state via Prometheus:

# mcpgateway/metrics/pool_metrics.py

from prometheus_client import Gauge, Counter

# Pool utilization gauges
db_pool_size = Gauge(
    "db_pool_size",
    "Total size of the database connection pool"
)
db_pool_checked_out = Gauge(
    "db_pool_checked_out",
    "Number of connections currently checked out from pool"
)
db_pool_overflow = Gauge(
    "db_pool_overflow",
    "Number of overflow connections currently in use"
)
db_pool_checked_in = Gauge(
    "db_pool_checked_in",
    "Number of connections available in pool"
)
db_pool_utilization = Gauge(
    "db_pool_utilization",
    "Pool utilization ratio (0.0-1.0)"
)

# Pool event counters
db_pool_timeout_total = Counter(
    "db_pool_timeout_total",
    "Total number of pool timeout errors"
)
db_pool_checkout_total = Counter(
    "db_pool_checkout_total",
    "Total connection checkouts from pool"
)
db_pool_invalidated_total = Counter(
    "db_pool_invalidated_total",
    "Total connections invalidated"
)


def update_pool_metrics(engine):
    """Update pool metrics from SQLAlchemy engine."""
    pool = engine.pool
    
    size = pool.size()
    checked_out = pool.checkedout()
    overflow = pool.overflow()
    checked_in = pool.checkedin()
    
    db_pool_size.set(size)
    db_pool_checked_out.set(checked_out)
    db_pool_overflow.set(overflow)
    db_pool_checked_in.set(checked_in)
    
    # Utilization = checked_out / (size + max_overflow)
    max_capacity = size + pool._max_overflow
    utilization = checked_out / max_capacity if max_capacity > 0 else 0
    db_pool_utilization.set(utilization)

2. Pool Event Listeners

Capture pool events for detailed observability:

# mcpgateway/db.py - Add event listeners

from sqlalchemy import event

@event.listens_for(engine, "checkout")
def on_checkout(dbapi_conn, connection_record, connection_proxy):
    """Track connection checkouts."""
    db_pool_checkout_total.inc()
    update_pool_metrics(engine)

@event.listens_for(engine, "checkin")
def on_checkin(dbapi_conn, connection_record):
    """Track connection returns."""
    update_pool_metrics(engine)

@event.listens_for(engine, "invalidate")
def on_invalidate(dbapi_conn, connection_record, exception):
    """Track connection invalidations."""
    db_pool_invalidated_total.inc()
    update_pool_metrics(engine)

3. Readiness Probe with Pool Health

# mcpgateway/main.py

from mcpgateway.config import settings

@app.get("/ready")
async def ready():
    """Readiness check including connection pool health.
    
    Fails readiness when pool utilization exceeds threshold,
    causing load balancer to stop routing traffic to this replica.
    """
    pool = engine.pool
    
    # Calculate pool utilization
    checked_out = pool.checkedout()
    max_capacity = pool.size() + pool._max_overflow
    utilization = checked_out / max_capacity if max_capacity > 0 else 0
    
    pool_healthy = utilization < settings.readiness_pool_threshold
    
    # Optional: Test actual connectivity
    db_connectable = True
    if settings.readiness_check_db_connection:
        try:
            with engine.connect() as conn:
                conn.execute(text("SELECT 1"))
        except Exception:
            db_connectable = False
    
    is_ready = pool_healthy and db_connectable
    
    status = {
        "status": "ready" if is_ready else "not_ready",
        "pool": {
            "utilization": round(utilization, 2),
            "checked_out": checked_out,
            "capacity": max_capacity,
            "healthy": pool_healthy,
        },
        "database": {
            "connectable": db_connectable,
        }
    }
    
    if not is_ready:
        return JSONResponse(status_code=503, content=status)
    
    return status

4. Enhanced Health Endpoint

@app.get("/health")
async def health():
    """Liveness check with optional pool details."""
    pool = engine.pool
    
    checked_out = pool.checkedout()
    max_capacity = pool.size() + pool._max_overflow
    utilization = checked_out / max_capacity if max_capacity > 0 else 0
    
    return {
        "status": "healthy",
        "pool": {
            "size": pool.size(),
            "checked_out": checked_out,
            "overflow": pool.overflow(),
            "checked_in": pool.checkedin(),
            "utilization": round(utilization, 2),
            "max_overflow": pool._max_overflow,
        }
    }

5. Configuration

# mcpgateway/config.py

# Pool health monitoring
pool_metrics_enabled: bool = Field(
    default=True,
    description="Enable Prometheus metrics for connection pool"
)
pool_metrics_interval: float = Field(
    default=5.0,
    description="Interval in seconds to update pool metrics"
)

# Readiness probe configuration
readiness_pool_threshold: float = Field(
    default=0.85,
    ge=0.0,
    le=1.0,
    description="Pool utilization threshold for readiness (0.85 = fail at 85%)"
)
readiness_check_db_connection: bool = Field(
    default=False,
    description="Perform actual DB query in readiness check (adds latency)"
)

6. Environment Variables

# Pool Metrics
POOL_METRICS_ENABLED=true
POOL_METRICS_INTERVAL=5.0

# Readiness Probe
READINESS_POOL_THRESHOLD=0.85
READINESS_CHECK_DB_CONNECTION=false

Prometheus Alerting Rules

# alerts/pool_alerts.yaml

groups:
  - name: database_pool
    rules:
      # Early warning - pool getting busy
      - alert: DatabasePoolHighUtilization
        expr: db_pool_utilization > 0.7
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Database pool utilization above 70%"
          description: "Pool utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      # Critical - pool near exhaustion
      - alert: DatabasePoolCritical
        expr: db_pool_utilization > 0.9
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Database pool near exhaustion"
          description: "Pool utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      # Pool timeout errors occurring
      - alert: DatabasePoolTimeouts
        expr: rate(db_pool_timeout_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database pool timeout errors"
          description: "Pool timeouts occurring at {{ $value }}/s on {{ $labels.instance }}"

      # High connection invalidation rate
      - alert: DatabaseConnectionInvalidations
        expr: rate(db_pool_invalidated_total[5m]) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High database connection invalidation rate"
          description: "Connections being invalidated at {{ $value }}/s on {{ $labels.instance }}"

Grafana Dashboard Queries

# Pool utilization over time
db_pool_utilization

# Available connections
db_pool_checked_in

# Checkout rate
rate(db_pool_checkout_total[1m])

# Timeout rate
rate(db_pool_timeout_total[5m])

# Pool capacity vs usage
db_pool_checked_out / (db_pool_size + ignoring(instance) group_left db_pool_overflow)

How This Helps Recovery

Before (Current Behavior)

High Load → All 3 replicas exhaust pools → All fail together → No recovery
                                                    ↓
                                    Load balancer keeps sending to all
                                                    ↓
                                    System stays degraded until load removed

After (With Pool Health in Readiness)

High Load → Replica 1 pool at 85% → Fails readiness check
                    ↓
            Load balancer stops sending traffic to Replica 1
                    ↓
            Replica 1 pool drains, recovers
                    ↓
            Replica 1 passes readiness, rejoins
                    ↓
            Traffic redistributes across healthy replicas

Files to Create

File Purpose
mcpgateway/metrics/pool_metrics.py Pool metric definitions and update logic
tests/unit/mcpgateway/metrics/test_pool_metrics.py Unit tests
infra/prometheus/alerts/pool_alerts.yaml Alerting rules
infra/grafana/dashboards/pool_dashboard.json Grafana dashboard

Files to Modify

File Changes
mcpgateway/db.py Add pool event listeners
mcpgateway/main.py Update /health and /ready endpoints
mcpgateway/config.py Add pool monitoring settings
.env.example Document new environment variables
docker-compose.yml Add default configuration
charts/mcp-stack/values.yaml Add Helm values
docs/docs/manage/configuration.md Document pool monitoring

Acceptance Criteria

  • Pool metrics exposed at /metrics endpoint
  • Pool event listeners capture checkout/checkin/invalidate events
  • /ready fails when pool utilization exceeds threshold
  • /health includes pool status details
  • Configuration via environment variables
  • Prometheus alerting rules provided
  • Grafana dashboard provided
  • Unit tests for metric collection
  • Integration test: verify readiness fails under pool pressure
  • Documentation updated
  • Load test verifies traffic sheds to healthy replicas

Performance Considerations

  • Pool metric collection is lightweight (reads pool state, no DB queries)
  • Event listeners add negligible overhead
  • Readiness check with READINESS_CHECK_DB_CONNECTION=false has no DB overhead
  • Metrics update interval configurable to balance freshness vs overhead

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseenhancementNew feature or requestperformancePerformance related items

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions