[PERFORMANCE]: Async Database Logging to Prevent Feedback Loop Under Load

## Summary

Implement asynchronous, buffered database logging to prevent error logging from worsening cascade failures. Currently, when `STRUCTURED_LOGGING_DATABASE_ENABLED=true`, each log entry triggers a **synchronous database write** that blocks until commit completes. Under high load with connection pool exhaustion, this creates a feedback loop where error logs compete for the same exhausted pool, amplifying the failure.

## Problem Statement

During sustained high-concurrency load, connection pool exhaustion triggers cascading errors. The current logging architecture exacerbates this:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     LOGGING FEEDBACK LOOP                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. High Load → Pool Exhaustion                                            │
│          │                                                                  │
│          ▼                                                                  │
│   2. Request fails with OperationalError                                    │
│          │                                                                  │
│          ▼                                                                  │
│   3. Error logged via structured_logger.error()                             │
│          │                                                                  │
│          ▼                                                                  │
│   4. LogRouter._persist_to_database() called                                │
│          │                                                                  │
│          ▼                                                                  │
│   5. Attempts to acquire DB connection from SAME exhausted pool             │
│          │                                                                  │
│          ├──► Success: Blocks thread waiting for commit                     │
│          │    (increases request latency, holds connection longer)          │
│          │                                                                  │
│          └──► Failure: Another timeout error → triggers MORE logging        │
│                        │                                                    │
│                        └──► GOTO step 3 (recursive amplification)           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Current Implementation

**Location**: `mcpgateway/services/structured_logger.py`

```python
# LogRouter._persist_to_database() - Lines 199-316
def _persist_to_database(self, entry: Dict[str, Any], db: Optional[Session] = None) -> None:
    if db is None:
        db = SessionLocal()  # ← Acquires connection from app pool
    try:
        log_entry = StructuredLogEntry(...)
        db.add(log_entry)
        db.commit()  # ← SYNCHRONOUS BLOCKING WRITE
    except Exception as e:
        db.rollback()
```

### Key Issues

| Issue | Impact |
|-------|--------|
| Synchronous `db.commit()` | Blocks calling thread until write completes |
| Uses same connection pool | Competes with application for exhausted connections |
| Per-entry writes | No batching - 1 DB round-trip per log entry |
| No backpressure | Logs queued indefinitely if DB slow |

## Proposed Solution

### Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     ASYNC LOGGING ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Application Thread                    Background Worker                   │
│   ─────────────────                    ─────────────────                    │
│                                                                             │
│   structured_logger.error()                                                 │
│          │                                                                  │
│          ▼                                                                  │
│   ┌──────────────────┐                                                      │
│   │  Enrich Entry    │                                                      │
│   └────────┬─────────┘                                                      │
│            │                                                                │
│            ▼                                                                │
│   ┌──────────────────┐     Non-blocking     ┌──────────────────┐            │
│   │  Bounded Queue   │ ──────────────────── │  Batch Writer    │            │
│   │  (in-memory)     │      put_nowait()    │  (dedicated      │            │
│   │                  │                      │   connection)    │            │
│   │  [entry1]        │                      │                  │            │
│   │  [entry2]        │                      │  - Batch inserts │            │
│   │  [entry3]        │                      │  - Retry logic   │            │
│   │  ...             │                      │  - Backpressure  │            │
│   └──────────────────┘                      └────────┬─────────┘            │
│            │                                         │                      │
│            │ (if queue full)                         ▼                      │
│            ▼                                  ┌──────────────────┐          │
│   ┌──────────────────┐                       │    Database      │           │
│   │  Drop + Metric   │                       │  (dedicated      │           │
│   │  (graceful shed) │                       │   connection)    │           │
│   └──────────────────┘                       └──────────────────┘           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Design Decisions

1. **Bounded Queue**: Fixed-size queue (10K entries) prevents memory exhaustion
2. **Non-blocking Put**: Application never waits for logging (<1µs)
3. **Graceful Shedding**: Drop logs when queue full (with counter metric)
4. **Batch Writes**: Reduce DB round-trips by batching 100 inserts
5. **Dedicated Connection**: Separate 2-connection pool from application pool
6. **Backpressure Awareness**: Retry with backoff if DB is struggling

### Performance Comparison

| Metric | Before (Sync) | After (Async) |
|--------|---------------|---------------|
| Latency added to error path | 10-100ms | <1µs |
| DB connections used | App pool | Dedicated (2) |
| Under pool exhaustion | Recursive failure | Graceful degradation |
| Log throughput | 100-500/s | 10,000+/s |
| Memory usage | Minimal | +10MB (buffer) |

### Graceful Degradation Under Load

| Scenario | Behavior |
|----------|----------|
| Normal load | Logs buffered, flushed in batches every 1s |
| High log volume | Larger batches, more efficient writes |
| Buffer filling | Metric incremented, continues operating |
| Buffer full | New entries dropped, application unaffected |
| DB connection issues | Retry with backoff, entries held in buffer |
| DB completely down | Buffer fills, drops entries, app continues |

## Configuration

```bash
# Use async buffered writes (recommended for production)
LOG_BUFFER_ASYNC_ENABLED=true

# Maximum log entries to buffer before dropping oldest
LOG_BUFFER_MAX_ENTRIES=10000

# Batch size for database writes (reduces round-trips)
LOG_BUFFER_BATCH_SIZE=100

# Maximum seconds between flushes
LOG_BUFFER_FLUSH_INTERVAL=1.0

# Dedicated connection pool size for log writes
LOG_BUFFER_DEDICATED_POOL_SIZE=2
```

## Prometheus Metrics

```python
log_buffer_size          # Current entries in buffer
log_buffer_dropped_total # Entries dropped due to full buffer
log_buffer_flushed_total # Entries successfully flushed
log_buffer_flush_duration_seconds # Flush latency histogram
log_buffer_flush_errors_total # Flush errors
```

## Files to Create

| File | Purpose |
|------|---------|
| `mcpgateway/services/async_log_buffer.py` | Async buffered log writer |
| `tests/unit/mcpgateway/services/test_async_log_buffer.py` | Unit tests |
| `tests/integration/test_async_logging.py` | Integration tests |

## Files to Modify

| File | Changes |
|------|---------|
| `mcpgateway/services/structured_logger.py` | Use async buffer in `_persist_to_database()` |
| `mcpgateway/config.py` | Add buffer configuration settings |
| `mcpgateway/main.py` | Initialize/shutdown async buffer |
| `.env.example` | Document new environment variables |
| `docker-compose.yml` | Add default configuration |
| `docs/docs/manage/logging.md` | Update logging architecture docs |

## Acceptance Criteria

- [ ] `AsyncLogBuffer` class implemented with bounded queue
- [ ] Non-blocking `put()` method (<1µs latency)
- [ ] Background flush thread with batch writes
- [ ] Dedicated connection pool (separate from app)
- [ ] Retry logic with exponential backoff
- [ ] Graceful shedding when buffer full
- [ ] Prometheus metrics for buffer health
- [ ] Configuration via environment variables
- [ ] Startup/shutdown lifecycle integration
- [ ] Unit tests with >90% coverage
- [ ] Integration test under load
- [ ] Documentation updated
- [ ] Load test confirms no feedback loop

## Related

- #1853 - Database Retry Mechanism
- #1854 - Global Rate Limiting  
- #1856 - Connection Pool Health Monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERFORMANCE]: Async Database Logging to Prevent Feedback Loop Under Load #1857

Summary

Problem Statement

Current Implementation

Key Issues

Proposed Solution

Architecture

Key Design Decisions

Performance Comparison

Graceful Degradation Under Load

Configuration

Prometheus Metrics

Files to Create

Files to Modify

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue	Impact
Synchronous `db.commit()`	Blocks calling thread until write completes
Uses same connection pool	Competes with application for exhausted connections
Per-entry writes	No batching - 1 DB round-trip per log entry
No backpressure	Logs queued indefinitely if DB slow

Metric	Before (Sync)	After (Async)
Latency added to error path	10-100ms	<1µs
DB connections used	App pool	Dedicated (2)
Under pool exhaustion	Recursive failure	Graceful degradation
Log throughput	100-500/s	10,000+/s
Memory usage	Minimal	+10MB (buffer)

Scenario	Behavior
Normal load	Logs buffered, flushed in batches every 1s
High log volume	Larger batches, more efficient writes
Buffer filling	Metric incremented, continues operating
Buffer full	New entries dropped, application unaffected
DB connection issues	Retry with backoff, entries held in buffer
DB completely down	Buffer fills, drops entries, app continues

File	Purpose
`mcpgateway/services/async_log_buffer.py`	Async buffered log writer
`tests/unit/mcpgateway/services/test_async_log_buffer.py`	Unit tests
`tests/integration/test_async_logging.py`	Integration tests

File	Changes
`mcpgateway/services/structured_logger.py`	Use async buffer in `_persist_to_database()`
`mcpgateway/config.py`	Add buffer configuration settings
`mcpgateway/main.py`	Initialize/shutdown async buffer
`.env.example`	Document new environment variables
`docker-compose.yml`	Add default configuration
`docs/docs/manage/logging.md`	Update logging architecture docs

[PERFORMANCE]: Async Database Logging to Prevent Feedback Loop Under Load #1857

Description

Summary

Problem Statement

Current Implementation

Key Issues

Proposed Solution

Architecture

Key Design Decisions

Performance Comparison

Graceful Degradation Under Load

Configuration

Prometheus Metrics

Files to Create

Files to Modify

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions