Skip to content

[PERFORMANCE]: Async Database Logging to Prevent Feedback Loop Under Load #1857

@crivetimihai

Description

@crivetimihai

Summary

Implement asynchronous, buffered database logging to prevent error logging from worsening cascade failures. Currently, when STRUCTURED_LOGGING_DATABASE_ENABLED=true, each log entry triggers a synchronous database write that blocks until commit completes. Under high load with connection pool exhaustion, this creates a feedback loop where error logs compete for the same exhausted pool, amplifying the failure.

Problem Statement

During sustained high-concurrency load, connection pool exhaustion triggers cascading errors. The current logging architecture exacerbates this:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     LOGGING FEEDBACK LOOP                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. High Load → Pool Exhaustion                                            │
│          │                                                                  │
│          ▼                                                                  │
│   2. Request fails with OperationalError                                    │
│          │                                                                  │
│          ▼                                                                  │
│   3. Error logged via structured_logger.error()                             │
│          │                                                                  │
│          ▼                                                                  │
│   4. LogRouter._persist_to_database() called                                │
│          │                                                                  │
│          ▼                                                                  │
│   5. Attempts to acquire DB connection from SAME exhausted pool             │
│          │                                                                  │
│          ├──► Success: Blocks thread waiting for commit                     │
│          │    (increases request latency, holds connection longer)          │
│          │                                                                  │
│          └──► Failure: Another timeout error → triggers MORE logging        │
│                        │                                                    │
│                        └──► GOTO step 3 (recursive amplification)           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Current Implementation

Location: mcpgateway/services/structured_logger.py

# LogRouter._persist_to_database() - Lines 199-316
def _persist_to_database(self, entry: Dict[str, Any], db: Optional[Session] = None) -> None:
    if db is None:
        db = SessionLocal()  # ← Acquires connection from app pool
    try:
        log_entry = StructuredLogEntry(...)
        db.add(log_entry)
        db.commit()  # ← SYNCHRONOUS BLOCKING WRITE
    except Exception as e:
        db.rollback()

Key Issues

Issue Impact
Synchronous db.commit() Blocks calling thread until write completes
Uses same connection pool Competes with application for exhausted connections
Per-entry writes No batching - 1 DB round-trip per log entry
No backpressure Logs queued indefinitely if DB slow

Proposed Solution

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                     ASYNC LOGGING ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Application Thread                    Background Worker                   │
│   ─────────────────                    ─────────────────                    │
│                                                                             │
│   structured_logger.error()                                                 │
│          │                                                                  │
│          ▼                                                                  │
│   ┌──────────────────┐                                                      │
│   │  Enrich Entry    │                                                      │
│   └────────┬─────────┘                                                      │
│            │                                                                │
│            ▼                                                                │
│   ┌──────────────────┐     Non-blocking     ┌──────────────────┐            │
│   │  Bounded Queue   │ ──────────────────── │  Batch Writer    │            │
│   │  (in-memory)     │      put_nowait()    │  (dedicated      │            │
│   │                  │                      │   connection)    │            │
│   │  [entry1]        │                      │                  │            │
│   │  [entry2]        │                      │  - Batch inserts │            │
│   │  [entry3]        │                      │  - Retry logic   │            │
│   │  ...             │                      │  - Backpressure  │            │
│   └──────────────────┘                      └────────┬─────────┘            │
│            │                                         │                      │
│            │ (if queue full)                         ▼                      │
│            ▼                                  ┌──────────────────┐          │
│   ┌──────────────────┐                       │    Database      │           │
│   │  Drop + Metric   │                       │  (dedicated      │           │
│   │  (graceful shed) │                       │   connection)    │           │
│   └──────────────────┘                       └──────────────────┘           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Design Decisions

  1. Bounded Queue: Fixed-size queue (10K entries) prevents memory exhaustion
  2. Non-blocking Put: Application never waits for logging (<1µs)
  3. Graceful Shedding: Drop logs when queue full (with counter metric)
  4. Batch Writes: Reduce DB round-trips by batching 100 inserts
  5. Dedicated Connection: Separate 2-connection pool from application pool
  6. Backpressure Awareness: Retry with backoff if DB is struggling

Performance Comparison

Metric Before (Sync) After (Async)
Latency added to error path 10-100ms <1µs
DB connections used App pool Dedicated (2)
Under pool exhaustion Recursive failure Graceful degradation
Log throughput 100-500/s 10,000+/s
Memory usage Minimal +10MB (buffer)

Graceful Degradation Under Load

Scenario Behavior
Normal load Logs buffered, flushed in batches every 1s
High log volume Larger batches, more efficient writes
Buffer filling Metric incremented, continues operating
Buffer full New entries dropped, application unaffected
DB connection issues Retry with backoff, entries held in buffer
DB completely down Buffer fills, drops entries, app continues

Configuration

# Use async buffered writes (recommended for production)
LOG_BUFFER_ASYNC_ENABLED=true

# Maximum log entries to buffer before dropping oldest
LOG_BUFFER_MAX_ENTRIES=10000

# Batch size for database writes (reduces round-trips)
LOG_BUFFER_BATCH_SIZE=100

# Maximum seconds between flushes
LOG_BUFFER_FLUSH_INTERVAL=1.0

# Dedicated connection pool size for log writes
LOG_BUFFER_DEDICATED_POOL_SIZE=2

Prometheus Metrics

log_buffer_size          # Current entries in buffer
log_buffer_dropped_total # Entries dropped due to full buffer
log_buffer_flushed_total # Entries successfully flushed
log_buffer_flush_duration_seconds # Flush latency histogram
log_buffer_flush_errors_total # Flush errors

Files to Create

File Purpose
mcpgateway/services/async_log_buffer.py Async buffered log writer
tests/unit/mcpgateway/services/test_async_log_buffer.py Unit tests
tests/integration/test_async_logging.py Integration tests

Files to Modify

File Changes
mcpgateway/services/structured_logger.py Use async buffer in _persist_to_database()
mcpgateway/config.py Add buffer configuration settings
mcpgateway/main.py Initialize/shutdown async buffer
.env.example Document new environment variables
docker-compose.yml Add default configuration
docs/docs/manage/logging.md Update logging architecture docs

Acceptance Criteria

  • AsyncLogBuffer class implemented with bounded queue
  • Non-blocking put() method (<1µs latency)
  • Background flush thread with batch writes
  • Dedicated connection pool (separate from app)
  • Retry logic with exponential backoff
  • Graceful shedding when buffer full
  • Prometheus metrics for buffer health
  • Configuration via environment variables
  • Startup/shutdown lifecycle integration
  • Unit tests with >90% coverage
  • Integration test under load
  • Documentation updated
  • Load test confirms no feedback loop

Related

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseenhancementNew feature or requestperformancePerformance related items

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions