Skip to content

[PERFORMANCE]: Audit Trail Performance & Configuration Enhancements #1745

@crivetimihai

Description

@crivetimihai

⚡ Feature: Audit Trail Performance & Configuration Enhancements

Goal

Enhance the audit trail system with configurable logging levels, retention policies, buffered writes, and storage optimizations to balance compliance requirements with performance. Enable production deployments to handle high load without database exhaustion.

Why Now?

  1. Load Testing Results: During 2000 concurrent users test, audit_trails accumulated 995,412 rows (788 MB) in 7 hours, causing PostgreSQL memory exhaustion
  2. Kill Switch Insufficient: AUDIT_TRAIL_ENABLED=false is too coarse; need granular control
  3. Compliance Requirements: Organizations need audit trails but can't sacrifice performance
  4. Storage Costs: Unbounded table growth increases backup times and storage costs

📖 User Stories

US-1: Operator - Configure Audit Logging Level

As a Platform Operator
I want to configure what operations get logged
So that I can reduce write volume while maintaining compliance

Acceptance Criteria:

Scenario: Log writes only (recommended)
  Given AUDIT_TRAIL_LEVEL=writes_only
  When a user creates a tool (CREATE operation)
  Then an audit record should be created
  When a user lists tools (READ operation)
  Then NO audit record should be created

Scenario: Log mutations only
  Given AUDIT_TRAIL_LEVEL=mutations_only
  When a user updates a server (UPDATE operation)
  Then an audit record should be created
  When a user creates a server (CREATE operation)
  Then NO audit record should be created

Scenario: Log all operations
  Given AUDIT_TRAIL_LEVEL=all
  When any CRUD operation occurs
  Then an audit record should be created

Technical Requirements:

  • Add AUDIT_TRAIL_LEVEL config with values: all, writes_only, mutations_only, deletes_only, failures_only
  • Default to writes_only for new deployments
  • Check level before creating audit record
US-2: Operator - Automatic Retention Cleanup

As a Platform Operator
I want old audit records automatically deleted
So that the table doesn't grow unbounded

Acceptance Criteria:

Scenario: Retention policy enforced
  Given AUDIT_TRAIL_RETENTION_DAYS=90
  And AUDIT_TRAIL_CLEANUP_ENABLED=true
  And audit records older than 90 days exist
  When the cleanup job runs
  Then records older than 90 days should be deleted
  And records newer than 90 days should be preserved

Scenario: Batch deletion
  Given AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000
  And 50000 records need deletion
  When the cleanup job runs
  Then records should be deleted in batches of 10000
  And database should not lock for extended periods

Technical Requirements:

  • Add AUDIT_TRAIL_RETENTION_DAYS config (default: 90)
  • Add AUDIT_TRAIL_CLEANUP_ENABLED config (default: true)
  • Add AUDIT_TRAIL_CLEANUP_BATCH_SIZE config (default: 10000)
  • Implement background cleanup task
US-3: Developer - Non-Blocking Audit Writes

As a Developer concerned about latency
I want audit writes to not block request handling
So that audit overhead doesn't impact response times

Acceptance Criteria:

Scenario: Async audit write
  Given AUDIT_TRAIL_ASYNC=true
  When a tool is created
  Then the API should return immediately
  And the audit record should be written in background
  And the audit write should not add to response latency

Scenario: Buffered audit writes
  Given AUDIT_TRAIL_BUFFER_ENABLED=true
  And AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30
  When 50 audit events occur in 10 seconds
  Then events should be buffered in memory
  And written to DB in a single batch after 30 seconds

Technical Requirements:

  • Add AUDIT_TRAIL_ASYNC config (default: true)
  • Add buffered write support similar to MetricsBufferService
  • Write batches to database periodically
US-4: Security Admin - Exclude Specific Actions

As a Security Administrator
I want to exclude noisy actions from audit logging
So that I can focus on meaningful events

Acceptance Criteria:

Scenario: Exclude specific actions
  Given AUDIT_TRAIL_EXCLUDE_ACTIONS=health_check,metrics_read
  When a health check endpoint is called
  Then NO audit record should be created
  When a tool is created
  Then an audit record SHOULD be created

Scenario: Exclude specific resources
  Given AUDIT_TRAIL_EXCLUDE_RESOURCES=metric,health
  When metrics are queried
  Then NO audit record should be created

Technical Requirements:

  • Add AUDIT_TRAIL_EXCLUDE_ACTIONS config (comma-separated)
  • Add AUDIT_TRAIL_EXCLUDE_RESOURCES config (comma-separated)
  • Check exclusions before creating audit record

🏗 Architecture

Audit Trail Flow with Buffering

sequenceDiagram
    participant API
    participant AuditService
    participant Buffer
    participant DB

    API->>AuditService: log_event(action, resource)
    AuditService->>AuditService: Check level & exclusions
    
    alt Async + Buffered
        AuditService->>Buffer: Add to buffer
        Note over Buffer: Accumulates events
        Buffer->>DB: Batch INSERT (every 30s or 500 events)
    else Sync (legacy)
        AuditService->>DB: INSERT single record
    end
Loading

Configuration Hierarchy

flowchart TD
    A[AUDIT_TRAIL_ENABLED] -->|false| Z[No logging]
    A -->|true| B[Check AUDIT_TRAIL_LEVEL]
    B --> C{Operation Type}
    C -->|matches level| D[Check Exclusions]
    C -->|doesn't match| Z
    D -->|excluded| Z
    D -->|not excluded| E[Create Audit Event]
    E --> F{AUDIT_TRAIL_ASYNC?}
    F -->|true| G[Add to Buffer]
    F -->|false| H[Write to DB]
Loading

📋 Implementation Tasks

Phase 1: Configurable Logging Level

  • Add AUDIT_TRAIL_LEVEL setting with enum values
  • Implement level checking in audit service
  • Add level to audit record for filtering
  • Update existing callers to include operation type

Phase 2: Async/Buffered Writes

  • Add AUDIT_TRAIL_ASYNC setting
  • Add AUDIT_TRAIL_BUFFER_ENABLED setting
  • Add AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL setting
  • Add AUDIT_TRAIL_BUFFER_MAX_SIZE setting
  • Implement AuditBufferService (similar to MetricsBufferService)
  • Add graceful shutdown to flush pending events

Phase 3: Retention Policy

  • Add AUDIT_TRAIL_RETENTION_DAYS setting
  • Add AUDIT_TRAIL_CLEANUP_ENABLED setting
  • Add AUDIT_TRAIL_CLEANUP_BATCH_SIZE setting
  • Implement background cleanup task
  • Add cleanup metrics (records deleted, duration)

Phase 4: Exclusions

  • Add AUDIT_TRAIL_EXCLUDE_ACTIONS setting
  • Add AUDIT_TRAIL_EXCLUDE_RESOURCES setting
  • Implement exclusion checking in audit service

Phase 5: Sampling (Optional)

  • Add AUDIT_TRAIL_SAMPLE_RATE setting
  • Add AUDIT_TRAIL_SAMPLE_READS setting
  • Implement probabilistic sampling for high-volume events

Phase 6: Documentation

  • Create docs/docs/manage/audit-trails.md
  • Document all configuration options
  • Add compliance guide (SOC2, HIPAA, GDPR)
  • Add troubleshooting section

Phase 7: Testing

  • Unit tests for level filtering
  • Unit tests for buffer service
  • Unit tests for cleanup task
  • Integration tests for end-to-end flow
  • Performance tests comparing sync vs async

⚙️ Configuration Example

# Recommended Production Configuration
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=writes_only           # Skip reads (90% reduction)
AUDIT_TRAIL_ASYNC=true                   # Non-blocking writes
AUDIT_TRAIL_BUFFER_ENABLED=true          # Batch writes
AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30     # Flush every 30s
AUDIT_TRAIL_BUFFER_MAX_SIZE=500          # Or when 500 events buffered
AUDIT_TRAIL_RETENTION_DAYS=90            # Keep 3 months
AUDIT_TRAIL_CLEANUP_ENABLED=true         # Auto-cleanup old records
AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000     # Delete in batches

# Load Testing / Development
AUDIT_TRAIL_ENABLED=false

# High-Security Compliance
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=all                    # Log everything
AUDIT_TRAIL_RETENTION_DAYS=365           # Keep 1 year

✅ Success Criteria

  • AUDIT_TRAIL_LEVEL=writes_only reduces writes by ~90%
  • Async writes eliminate audit latency from request path
  • Buffered writes reduce DB round-trips by ~95%
  • Retention cleanup prevents unbounded table growth
  • Exclusions allow filtering noisy events
  • Load test with 2000 users passes without DB exhaustion
  • All configurations documented in .env.example
  • Compliance documentation created

🏁 Definition of Done

  • Logging level configuration implemented
  • Async/buffered writes implemented
  • Retention cleanup implemented
  • Exclusions implemented
  • Unit tests written and passing
  • Integration tests written and passing
  • Performance tests show improvement
  • Code passes make verify
  • Documentation created
  • Settings documented in .env.example
  • PR reviewed and approved

📝 Additional Notes

Implementation Priority

Enhancement Effort Impact Priority
Logging Level (writes_only) Low High P0
Async Writes Low High P0
Exclusions Low Medium P1
Buffered Writes Medium High P1
Retention Policy Medium Medium P1
Sampling Low Medium P2

Performance Expectations

Configuration Write Reduction Latency Impact
level=writes_only ~90% None
async=true 0% -50ms avg
buffer=true 0% -95% DB calls
Combined ~90% Near zero

🔗 Related Issues

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseauditCompliance and auditingdatabaseenhancementNew feature or requestperformancePerformance related itemssecurityImproves security

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions