-
Notifications
You must be signed in to change notification settings - Fork 614
[PERFORMANCE]: Async Database Logging to Prevent Feedback Loop Under Load #1857
Copy link
Copy link
Open
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseenhancementNew feature or requestNew feature or requestperformancePerformance related itemsPerformance related items
Milestone
Description
Summary
Implement asynchronous, buffered database logging to prevent error logging from worsening cascade failures. Currently, when STRUCTURED_LOGGING_DATABASE_ENABLED=true, each log entry triggers a synchronous database write that blocks until commit completes. Under high load with connection pool exhaustion, this creates a feedback loop where error logs compete for the same exhausted pool, amplifying the failure.
Problem Statement
During sustained high-concurrency load, connection pool exhaustion triggers cascading errors. The current logging architecture exacerbates this:
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOGGING FEEDBACK LOOP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. High Load → Pool Exhaustion │
│ │ │
│ ▼ │
│ 2. Request fails with OperationalError │
│ │ │
│ ▼ │
│ 3. Error logged via structured_logger.error() │
│ │ │
│ ▼ │
│ 4. LogRouter._persist_to_database() called │
│ │ │
│ ▼ │
│ 5. Attempts to acquire DB connection from SAME exhausted pool │
│ │ │
│ ├──► Success: Blocks thread waiting for commit │
│ │ (increases request latency, holds connection longer) │
│ │ │
│ └──► Failure: Another timeout error → triggers MORE logging │
│ │ │
│ └──► GOTO step 3 (recursive amplification) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Current Implementation
Location: mcpgateway/services/structured_logger.py
# LogRouter._persist_to_database() - Lines 199-316
def _persist_to_database(self, entry: Dict[str, Any], db: Optional[Session] = None) -> None:
if db is None:
db = SessionLocal() # ← Acquires connection from app pool
try:
log_entry = StructuredLogEntry(...)
db.add(log_entry)
db.commit() # ← SYNCHRONOUS BLOCKING WRITE
except Exception as e:
db.rollback()Key Issues
| Issue | Impact |
|---|---|
Synchronous db.commit() |
Blocks calling thread until write completes |
| Uses same connection pool | Competes with application for exhausted connections |
| Per-entry writes | No batching - 1 DB round-trip per log entry |
| No backpressure | Logs queued indefinitely if DB slow |
Proposed Solution
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ ASYNC LOGGING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Application Thread Background Worker │
│ ───────────────── ───────────────── │
│ │
│ structured_logger.error() │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Enrich Entry │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ Non-blocking ┌──────────────────┐ │
│ │ Bounded Queue │ ──────────────────── │ Batch Writer │ │
│ │ (in-memory) │ put_nowait() │ (dedicated │ │
│ │ │ │ connection) │ │
│ │ [entry1] │ │ │ │
│ │ [entry2] │ │ - Batch inserts │ │
│ │ [entry3] │ │ - Retry logic │ │
│ │ ... │ │ - Backpressure │ │
│ └──────────────────┘ └────────┬─────────┘ │
│ │ │ │
│ │ (if queue full) ▼ │
│ ▼ ┌──────────────────┐ │
│ ┌──────────────────┐ │ Database │ │
│ │ Drop + Metric │ │ (dedicated │ │
│ │ (graceful shed) │ │ connection) │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Design Decisions
- Bounded Queue: Fixed-size queue (10K entries) prevents memory exhaustion
- Non-blocking Put: Application never waits for logging (<1µs)
- Graceful Shedding: Drop logs when queue full (with counter metric)
- Batch Writes: Reduce DB round-trips by batching 100 inserts
- Dedicated Connection: Separate 2-connection pool from application pool
- Backpressure Awareness: Retry with backoff if DB is struggling
Performance Comparison
| Metric | Before (Sync) | After (Async) |
|---|---|---|
| Latency added to error path | 10-100ms | <1µs |
| DB connections used | App pool | Dedicated (2) |
| Under pool exhaustion | Recursive failure | Graceful degradation |
| Log throughput | 100-500/s | 10,000+/s |
| Memory usage | Minimal | +10MB (buffer) |
Graceful Degradation Under Load
| Scenario | Behavior |
|---|---|
| Normal load | Logs buffered, flushed in batches every 1s |
| High log volume | Larger batches, more efficient writes |
| Buffer filling | Metric incremented, continues operating |
| Buffer full | New entries dropped, application unaffected |
| DB connection issues | Retry with backoff, entries held in buffer |
| DB completely down | Buffer fills, drops entries, app continues |
Configuration
# Use async buffered writes (recommended for production)
LOG_BUFFER_ASYNC_ENABLED=true
# Maximum log entries to buffer before dropping oldest
LOG_BUFFER_MAX_ENTRIES=10000
# Batch size for database writes (reduces round-trips)
LOG_BUFFER_BATCH_SIZE=100
# Maximum seconds between flushes
LOG_BUFFER_FLUSH_INTERVAL=1.0
# Dedicated connection pool size for log writes
LOG_BUFFER_DEDICATED_POOL_SIZE=2Prometheus Metrics
log_buffer_size # Current entries in buffer
log_buffer_dropped_total # Entries dropped due to full buffer
log_buffer_flushed_total # Entries successfully flushed
log_buffer_flush_duration_seconds # Flush latency histogram
log_buffer_flush_errors_total # Flush errorsFiles to Create
| File | Purpose |
|---|---|
mcpgateway/services/async_log_buffer.py |
Async buffered log writer |
tests/unit/mcpgateway/services/test_async_log_buffer.py |
Unit tests |
tests/integration/test_async_logging.py |
Integration tests |
Files to Modify
| File | Changes |
|---|---|
mcpgateway/services/structured_logger.py |
Use async buffer in _persist_to_database() |
mcpgateway/config.py |
Add buffer configuration settings |
mcpgateway/main.py |
Initialize/shutdown async buffer |
.env.example |
Document new environment variables |
docker-compose.yml |
Add default configuration |
docs/docs/manage/logging.md |
Update logging architecture docs |
Acceptance Criteria
-
AsyncLogBufferclass implemented with bounded queue - Non-blocking
put()method (<1µs latency) - Background flush thread with batch writes
- Dedicated connection pool (separate from app)
- Retry logic with exponential backoff
- Graceful shedding when buffer full
- Prometheus metrics for buffer health
- Configuration via environment variables
- Startup/shutdown lifecycle integration
- Unit tests with >90% coverage
- Integration test under load
- Documentation updated
- Load test confirms no feedback loop
Related
- [PERFORMANCE]: Database Retry Mechanism for High-Concurrency Resilience #1853 - Database Retry Mechanism
- [PERFORMANCE]: Global Rate Limiting for Gateway Protection #1854 - Global Rate Limiting
- [PERFORMANCE]: Connection Pool Health Monitoring and Readiness Integration #1856 - Connection Pool Health Monitoring
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseenhancementNew feature or requestNew feature or requestperformancePerformance related itemsPerformance related items