[PERFORMANCE]: Connection Pool Health Monitoring and Readiness Integration

## Summary

Add comprehensive connection pool health monitoring with Prometheus metrics and integrate pool health into Kubernetes readiness probes. This enables early warning before pool exhaustion and automatic traffic shedding from struggling replicas.

## Problem Statement

Under sustained high-concurrency load, connection pool exhaustion can occur without warning. Current behavior:

1. **No visibility into pool state** - No metrics expose pool utilization until failure
2. **Readiness probes ignore pool health** - Replicas with exhausted pools continue receiving traffic
3. **No early warning** - Alerts can only fire after errors occur, not before
4. **All replicas fail together** - Load balancer can't route away from struggling instances

This results in cascade failures where all replicas become degraded simultaneously instead of graceful degradation where healthy replicas absorb traffic from unhealthy ones.

## Current State

### Existing Health Endpoints

| Endpoint | Purpose | Pool Awareness |
|----------|---------|----------------|
| `/health` | Liveness check | ❌ No |
| `/ready` | Readiness check | ❌ No |
| `/metrics` | Prometheus metrics | ❌ No pool metrics |

### Current `/ready` Implementation

```python
# mcpgateway/main.py
@app.get("/ready")
async def ready():
    # Only checks if app is running, not pool health
    return {"status": "ready"}
```

## Proposed Solution

### 1. Connection Pool Metrics

Expose SQLAlchemy pool state via Prometheus:

```python
# mcpgateway/metrics/pool_metrics.py

from prometheus_client import Gauge, Counter

# Pool utilization gauges
db_pool_size = Gauge(
    "db_pool_size",
    "Total size of the database connection pool"
)
db_pool_checked_out = Gauge(
    "db_pool_checked_out",
    "Number of connections currently checked out from pool"
)
db_pool_overflow = Gauge(
    "db_pool_overflow",
    "Number of overflow connections currently in use"
)
db_pool_checked_in = Gauge(
    "db_pool_checked_in",
    "Number of connections available in pool"
)
db_pool_utilization = Gauge(
    "db_pool_utilization",
    "Pool utilization ratio (0.0-1.0)"
)

# Pool event counters
db_pool_timeout_total = Counter(
    "db_pool_timeout_total",
    "Total number of pool timeout errors"
)
db_pool_checkout_total = Counter(
    "db_pool_checkout_total",
    "Total connection checkouts from pool"
)
db_pool_invalidated_total = Counter(
    "db_pool_invalidated_total",
    "Total connections invalidated"
)


def update_pool_metrics(engine):
    """Update pool metrics from SQLAlchemy engine."""
    pool = engine.pool
    
    size = pool.size()
    checked_out = pool.checkedout()
    overflow = pool.overflow()
    checked_in = pool.checkedin()
    
    db_pool_size.set(size)
    db_pool_checked_out.set(checked_out)
    db_pool_overflow.set(overflow)
    db_pool_checked_in.set(checked_in)
    
    # Utilization = checked_out / (size + max_overflow)
    max_capacity = size + pool._max_overflow
    utilization = checked_out / max_capacity if max_capacity > 0 else 0
    db_pool_utilization.set(utilization)
```

### 2. Pool Event Listeners

Capture pool events for detailed observability:

```python
# mcpgateway/db.py - Add event listeners

from sqlalchemy import event

@event.listens_for(engine, "checkout")
def on_checkout(dbapi_conn, connection_record, connection_proxy):
    """Track connection checkouts."""
    db_pool_checkout_total.inc()
    update_pool_metrics(engine)

@event.listens_for(engine, "checkin")
def on_checkin(dbapi_conn, connection_record):
    """Track connection returns."""
    update_pool_metrics(engine)

@event.listens_for(engine, "invalidate")
def on_invalidate(dbapi_conn, connection_record, exception):
    """Track connection invalidations."""
    db_pool_invalidated_total.inc()
    update_pool_metrics(engine)
```

### 3. Readiness Probe with Pool Health

```python
# mcpgateway/main.py

from mcpgateway.config import settings

@app.get("/ready")
async def ready():
    """Readiness check including connection pool health.
    
    Fails readiness when pool utilization exceeds threshold,
    causing load balancer to stop routing traffic to this replica.
    """
    pool = engine.pool
    
    # Calculate pool utilization
    checked_out = pool.checkedout()
    max_capacity = pool.size() + pool._max_overflow
    utilization = checked_out / max_capacity if max_capacity > 0 else 0
    
    pool_healthy = utilization < settings.readiness_pool_threshold
    
    # Optional: Test actual connectivity
    db_connectable = True
    if settings.readiness_check_db_connection:
        try:
            with engine.connect() as conn:
                conn.execute(text("SELECT 1"))
        except Exception:
            db_connectable = False
    
    is_ready = pool_healthy and db_connectable
    
    status = {
        "status": "ready" if is_ready else "not_ready",
        "pool": {
            "utilization": round(utilization, 2),
            "checked_out": checked_out,
            "capacity": max_capacity,
            "healthy": pool_healthy,
        },
        "database": {
            "connectable": db_connectable,
        }
    }
    
    if not is_ready:
        return JSONResponse(status_code=503, content=status)
    
    return status
```

### 4. Enhanced Health Endpoint

```python
@app.get("/health")
async def health():
    """Liveness check with optional pool details."""
    pool = engine.pool
    
    checked_out = pool.checkedout()
    max_capacity = pool.size() + pool._max_overflow
    utilization = checked_out / max_capacity if max_capacity > 0 else 0
    
    return {
        "status": "healthy",
        "pool": {
            "size": pool.size(),
            "checked_out": checked_out,
            "overflow": pool.overflow(),
            "checked_in": pool.checkedin(),
            "utilization": round(utilization, 2),
            "max_overflow": pool._max_overflow,
        }
    }
```

### 5. Configuration

```python
# mcpgateway/config.py

# Pool health monitoring
pool_metrics_enabled: bool = Field(
    default=True,
    description="Enable Prometheus metrics for connection pool"
)
pool_metrics_interval: float = Field(
    default=5.0,
    description="Interval in seconds to update pool metrics"
)

# Readiness probe configuration
readiness_pool_threshold: float = Field(
    default=0.85,
    ge=0.0,
    le=1.0,
    description="Pool utilization threshold for readiness (0.85 = fail at 85%)"
)
readiness_check_db_connection: bool = Field(
    default=False,
    description="Perform actual DB query in readiness check (adds latency)"
)
```

### 6. Environment Variables

```bash
# Pool Metrics
POOL_METRICS_ENABLED=true
POOL_METRICS_INTERVAL=5.0

# Readiness Probe
READINESS_POOL_THRESHOLD=0.85
READINESS_CHECK_DB_CONNECTION=false
```

## Prometheus Alerting Rules

```yaml
# alerts/pool_alerts.yaml

groups:
  - name: database_pool
    rules:
      # Early warning - pool getting busy
      - alert: DatabasePoolHighUtilization
        expr: db_pool_utilization > 0.7
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Database pool utilization above 70%"
          description: "Pool utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      # Critical - pool near exhaustion
      - alert: DatabasePoolCritical
        expr: db_pool_utilization > 0.9
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Database pool near exhaustion"
          description: "Pool utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      # Pool timeout errors occurring
      - alert: DatabasePoolTimeouts
        expr: rate(db_pool_timeout_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database pool timeout errors"
          description: "Pool timeouts occurring at {{ $value }}/s on {{ $labels.instance }}"

      # High connection invalidation rate
      - alert: DatabaseConnectionInvalidations
        expr: rate(db_pool_invalidated_total[5m]) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High database connection invalidation rate"
          description: "Connections being invalidated at {{ $value }}/s on {{ $labels.instance }}"
```

## Grafana Dashboard Queries

```promql
# Pool utilization over time
db_pool_utilization

# Available connections
db_pool_checked_in

# Checkout rate
rate(db_pool_checkout_total[1m])

# Timeout rate
rate(db_pool_timeout_total[5m])

# Pool capacity vs usage
db_pool_checked_out / (db_pool_size + ignoring(instance) group_left db_pool_overflow)
```

## How This Helps Recovery

### Before (Current Behavior)

```
High Load → All 3 replicas exhaust pools → All fail together → No recovery
                                                    ↓
                                    Load balancer keeps sending to all
                                                    ↓
                                    System stays degraded until load removed
```

### After (With Pool Health in Readiness)

```
High Load → Replica 1 pool at 85% → Fails readiness check
                    ↓
            Load balancer stops sending traffic to Replica 1
                    ↓
            Replica 1 pool drains, recovers
                    ↓
            Replica 1 passes readiness, rejoins
                    ↓
            Traffic redistributes across healthy replicas
```

## Files to Create

| File | Purpose |
|------|---------|
| `mcpgateway/metrics/pool_metrics.py` | Pool metric definitions and update logic |
| `tests/unit/mcpgateway/metrics/test_pool_metrics.py` | Unit tests |
| `infra/prometheus/alerts/pool_alerts.yaml` | Alerting rules |
| `infra/grafana/dashboards/pool_dashboard.json` | Grafana dashboard |

## Files to Modify

| File | Changes |
|------|---------|
| `mcpgateway/db.py` | Add pool event listeners |
| `mcpgateway/main.py` | Update `/health` and `/ready` endpoints |
| `mcpgateway/config.py` | Add pool monitoring settings |
| `.env.example` | Document new environment variables |
| `docker-compose.yml` | Add default configuration |
| `charts/mcp-stack/values.yaml` | Add Helm values |
| `docs/docs/manage/configuration.md` | Document pool monitoring |

## Acceptance Criteria

- [ ] Pool metrics exposed at `/metrics` endpoint
- [ ] Pool event listeners capture checkout/checkin/invalidate events
- [ ] `/ready` fails when pool utilization exceeds threshold
- [ ] `/health` includes pool status details
- [ ] Configuration via environment variables
- [ ] Prometheus alerting rules provided
- [ ] Grafana dashboard provided
- [ ] Unit tests for metric collection
- [ ] Integration test: verify readiness fails under pool pressure
- [ ] Documentation updated
- [ ] Load test verifies traffic sheds to healthy replicas

## Performance Considerations

- Pool metric collection is lightweight (reads pool state, no DB queries)
- Event listeners add negligible overhead
- Readiness check with `READINESS_CHECK_DB_CONNECTION=false` has no DB overhead
- Metrics update interval configurable to balance freshness vs overhead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERFORMANCE]: Connection Pool Health Monitoring and Readiness Integration #1856

Summary

Problem Statement

Current State

Existing Health Endpoints

Current `/ready` Implementation

Proposed Solution

1. Connection Pool Metrics

2. Pool Event Listeners

3. Readiness Probe with Pool Health

4. Enhanced Health Endpoint

5. Configuration

6. Environment Variables

Prometheus Alerting Rules

Grafana Dashboard Queries

How This Helps Recovery

Before (Current Behavior)

After (With Pool Health in Readiness)

Files to Create

Files to Modify

Acceptance Criteria

Performance Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Endpoint	Purpose	Pool Awareness
`/health`	Liveness check	❌ No
`/ready`	Readiness check	❌ No
`/metrics`	Prometheus metrics	❌ No pool metrics

File	Purpose
`mcpgateway/metrics/pool_metrics.py`	Pool metric definitions and update logic
`tests/unit/mcpgateway/metrics/test_pool_metrics.py`	Unit tests
`infra/prometheus/alerts/pool_alerts.yaml`	Alerting rules
`infra/grafana/dashboards/pool_dashboard.json`	Grafana dashboard

File	Changes
`mcpgateway/db.py`	Add pool event listeners
`mcpgateway/main.py`	Update `/health` and `/ready` endpoints
`mcpgateway/config.py`	Add pool monitoring settings
`.env.example`	Document new environment variables
`docker-compose.yml`	Add default configuration
`charts/mcp-stack/values.yaml`	Add Helm values
`docs/docs/manage/configuration.md`	Document pool monitoring

[PERFORMANCE]: Connection Pool Health Monitoring and Readiness Integration #1856

Description

Summary

Problem Statement

Current State

Existing Health Endpoints

Current /ready Implementation

Proposed Solution

1. Connection Pool Metrics

2. Pool Event Listeners

3. Readiness Probe with Pool Health

4. Enhanced Health Endpoint

5. Configuration

6. Environment Variables

Prometheus Alerting Rules

Grafana Dashboard Queries

How This Helps Recovery

Before (Current Behavior)

After (With Pool Health in Readiness)

Files to Create

Files to Modify

Acceptance Criteria

Performance Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Current `/ready` Implementation