You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance the audit trail system with configurable logging levels, retention policies, buffered writes, and storage optimizations to balance compliance requirements with performance. Enable production deployments to handle high load without database exhaustion.
Kill Switch Insufficient: AUDIT_TRAIL_ENABLED=false is too coarse; need granular control
Compliance Requirements: Organizations need audit trails but can't sacrifice performance
Storage Costs: Unbounded table growth increases backup times and storage costs
📖 User Stories
US-1: Operator - Configure Audit Logging Level
As a Platform Operator I want to configure what operations get logged So that I can reduce write volume while maintaining compliance
Acceptance Criteria:
Scenario: Log writes only (recommended)Given AUDIT_TRAIL_LEVEL=writes_only
When a user creates a tool (CREATE operation)
Then an audit record should be created
When a user lists tools (READ operation)
Then NO audit record should be created
Scenario: Log mutations onlyGiven AUDIT_TRAIL_LEVEL=mutations_only
When a user updates a server (UPDATE operation)
Then an audit record should be created
When a user creates a server (CREATE operation)
Then NO audit record should be created
Scenario: Log all operationsGiven AUDIT_TRAIL_LEVEL=all
When any CRUD operation occurs
Then an audit record should be created
Technical Requirements:
Add AUDIT_TRAIL_LEVEL config with values: all, writes_only, mutations_only, deletes_only, failures_only
Default to writes_only for new deployments
Check level before creating audit record
US-2: Operator - Automatic Retention Cleanup
As a Platform Operator I want old audit records automatically deleted So that the table doesn't grow unbounded
Acceptance Criteria:
Scenario: Retention policy enforcedGiven AUDIT_TRAIL_RETENTION_DAYS=90
And AUDIT_TRAIL_CLEANUP_ENABLED=true
And audit records older than 90 days exist
When the cleanup job runs
Then records older than 90 days should be deleted
And records newer than 90 days should be preserved
Scenario: Batch deletionGiven AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000
And 50000 records need deletion
When the cleanup job runs
Then records should be deleted in batches of 10000
And database should not lock for extended periods
As a Developer concerned about latency I want audit writes to not block request handling So that audit overhead doesn't impact response times
Acceptance Criteria:
Scenario: Async audit writeGiven AUDIT_TRAIL_ASYNC=true
When a tool is created
Then the API should return immediately
And the audit record should be written in background
And the audit write should not add to response latency
Scenario: Buffered audit writesGiven AUDIT_TRAIL_BUFFER_ENABLED=true
And AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30
When 50 audit events occur in 10 seconds
Then events should be buffered in memory
And written to DB in a single batch after 30 seconds
Technical Requirements:
Add AUDIT_TRAIL_ASYNC config (default: true)
Add buffered write support similar to MetricsBufferService
Write batches to database periodically
US-4: Security Admin - Exclude Specific Actions
As a Security Administrator I want to exclude noisy actions from audit logging So that I can focus on meaningful events
Acceptance Criteria:
Scenario: Exclude specific actionsGiven AUDIT_TRAIL_EXCLUDE_ACTIONS=health_check,metrics_read
When a health check endpoint is called
Then NO audit record should be created
When a tool is created
Then an audit record SHOULD be created
Scenario: Exclude specific resourcesGiven AUDIT_TRAIL_EXCLUDE_RESOURCES=metric,health
When metrics are queried
Then NO audit record should be created
sequenceDiagram
participant API
participant AuditService
participant Buffer
participant DB
API->>AuditService: log_event(action, resource)
AuditService->>AuditService: Check level & exclusions
alt Async + Buffered
AuditService->>Buffer: Add to buffer
Note over Buffer: Accumulates events
Buffer->>DB: Batch INSERT (every 30s or 500 events)
else Sync (legacy)
AuditService->>DB: INSERT single record
end
Loading
Configuration Hierarchy
flowchart TD
A[AUDIT_TRAIL_ENABLED] -->|false| Z[No logging]
A -->|true| B[Check AUDIT_TRAIL_LEVEL]
B --> C{Operation Type}
C -->|matches level| D[Check Exclusions]
C -->|doesn't match| Z
D -->|excluded| Z
D -->|not excluded| E[Create Audit Event]
E --> F{AUDIT_TRAIL_ASYNC?}
F -->|true| G[Add to Buffer]
F -->|false| H[Write to DB]
Loading
📋 Implementation Tasks
Phase 1: Configurable Logging Level
Add AUDIT_TRAIL_LEVEL setting with enum values
Implement level checking in audit service
Add level to audit record for filtering
Update existing callers to include operation type
Phase 2: Async/Buffered Writes
Add AUDIT_TRAIL_ASYNC setting
Add AUDIT_TRAIL_BUFFER_ENABLED setting
Add AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL setting
Add AUDIT_TRAIL_BUFFER_MAX_SIZE setting
Implement AuditBufferService (similar to MetricsBufferService)
Add graceful shutdown to flush pending events
Phase 3: Retention Policy
Add AUDIT_TRAIL_RETENTION_DAYS setting
Add AUDIT_TRAIL_CLEANUP_ENABLED setting
Add AUDIT_TRAIL_CLEANUP_BATCH_SIZE setting
Implement background cleanup task
Add cleanup metrics (records deleted, duration)
Phase 4: Exclusions
Add AUDIT_TRAIL_EXCLUDE_ACTIONS setting
Add AUDIT_TRAIL_EXCLUDE_RESOURCES setting
Implement exclusion checking in audit service
Phase 5: Sampling (Optional)
Add AUDIT_TRAIL_SAMPLE_RATE setting
Add AUDIT_TRAIL_SAMPLE_READS setting
Implement probabilistic sampling for high-volume events
Phase 6: Documentation
Create docs/docs/manage/audit-trails.md
Document all configuration options
Add compliance guide (SOC2, HIPAA, GDPR)
Add troubleshooting section
Phase 7: Testing
Unit tests for level filtering
Unit tests for buffer service
Unit tests for cleanup task
Integration tests for end-to-end flow
Performance tests comparing sync vs async
⚙️ Configuration Example
# Recommended Production Configuration
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=writes_only # Skip reads (90% reduction)
AUDIT_TRAIL_ASYNC=true # Non-blocking writes
AUDIT_TRAIL_BUFFER_ENABLED=true # Batch writes
AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30 # Flush every 30s
AUDIT_TRAIL_BUFFER_MAX_SIZE=500 # Or when 500 events buffered
AUDIT_TRAIL_RETENTION_DAYS=90 # Keep 3 months
AUDIT_TRAIL_CLEANUP_ENABLED=true # Auto-cleanup old records
AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000 # Delete in batches# Load Testing / Development
AUDIT_TRAIL_ENABLED=false
# High-Security Compliance
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=all # Log everything
AUDIT_TRAIL_RETENTION_DAYS=365 # Keep 1 year
✅ Success Criteria
AUDIT_TRAIL_LEVEL=writes_only reduces writes by ~90%
Async writes eliminate audit latency from request path
Buffered writes reduce DB round-trips by ~95%
Retention cleanup prevents unbounded table growth
Exclusions allow filtering noisy events
Load test with 2000 users passes without DB exhaustion
⚡ Feature: Audit Trail Performance & Configuration Enhancements
Goal
Enhance the audit trail system with configurable logging levels, retention policies, buffered writes, and storage optimizations to balance compliance requirements with performance. Enable production deployments to handle high load without database exhaustion.
Why Now?
audit_trailsaccumulated 995,412 rows (788 MB) in 7 hours, causing PostgreSQL memory exhaustionAUDIT_TRAIL_ENABLED=falseis too coarse; need granular control📖 User Stories
US-1: Operator - Configure Audit Logging Level
As a Platform Operator
I want to configure what operations get logged
So that I can reduce write volume while maintaining compliance
Acceptance Criteria:
Technical Requirements:
AUDIT_TRAIL_LEVELconfig with values:all,writes_only,mutations_only,deletes_only,failures_onlywrites_onlyfor new deploymentsUS-2: Operator - Automatic Retention Cleanup
As a Platform Operator
I want old audit records automatically deleted
So that the table doesn't grow unbounded
Acceptance Criteria:
Technical Requirements:
AUDIT_TRAIL_RETENTION_DAYSconfig (default: 90)AUDIT_TRAIL_CLEANUP_ENABLEDconfig (default: true)AUDIT_TRAIL_CLEANUP_BATCH_SIZEconfig (default: 10000)US-3: Developer - Non-Blocking Audit Writes
As a Developer concerned about latency
I want audit writes to not block request handling
So that audit overhead doesn't impact response times
Acceptance Criteria:
Technical Requirements:
AUDIT_TRAIL_ASYNCconfig (default: true)MetricsBufferServiceUS-4: Security Admin - Exclude Specific Actions
As a Security Administrator
I want to exclude noisy actions from audit logging
So that I can focus on meaningful events
Acceptance Criteria:
Technical Requirements:
AUDIT_TRAIL_EXCLUDE_ACTIONSconfig (comma-separated)AUDIT_TRAIL_EXCLUDE_RESOURCESconfig (comma-separated)🏗 Architecture
Audit Trail Flow with Buffering
sequenceDiagram participant API participant AuditService participant Buffer participant DB API->>AuditService: log_event(action, resource) AuditService->>AuditService: Check level & exclusions alt Async + Buffered AuditService->>Buffer: Add to buffer Note over Buffer: Accumulates events Buffer->>DB: Batch INSERT (every 30s or 500 events) else Sync (legacy) AuditService->>DB: INSERT single record endConfiguration Hierarchy
flowchart TD A[AUDIT_TRAIL_ENABLED] -->|false| Z[No logging] A -->|true| B[Check AUDIT_TRAIL_LEVEL] B --> C{Operation Type} C -->|matches level| D[Check Exclusions] C -->|doesn't match| Z D -->|excluded| Z D -->|not excluded| E[Create Audit Event] E --> F{AUDIT_TRAIL_ASYNC?} F -->|true| G[Add to Buffer] F -->|false| H[Write to DB]📋 Implementation Tasks
Phase 1: Configurable Logging Level
AUDIT_TRAIL_LEVELsetting with enum valuesPhase 2: Async/Buffered Writes
AUDIT_TRAIL_ASYNCsettingAUDIT_TRAIL_BUFFER_ENABLEDsettingAUDIT_TRAIL_BUFFER_FLUSH_INTERVALsettingAUDIT_TRAIL_BUFFER_MAX_SIZEsettingAuditBufferService(similar to MetricsBufferService)Phase 3: Retention Policy
AUDIT_TRAIL_RETENTION_DAYSsettingAUDIT_TRAIL_CLEANUP_ENABLEDsettingAUDIT_TRAIL_CLEANUP_BATCH_SIZEsettingPhase 4: Exclusions
AUDIT_TRAIL_EXCLUDE_ACTIONSsettingAUDIT_TRAIL_EXCLUDE_RESOURCESsettingPhase 5: Sampling (Optional)
AUDIT_TRAIL_SAMPLE_RATEsettingAUDIT_TRAIL_SAMPLE_READSsettingPhase 6: Documentation
docs/docs/manage/audit-trails.mdPhase 7: Testing
⚙️ Configuration Example
✅ Success Criteria
AUDIT_TRAIL_LEVEL=writes_onlyreduces writes by ~90%.env.example🏁 Definition of Done
make verify.env.example📝 Additional Notes
Implementation Priority
writes_only)Performance Expectations
level=writes_onlyasync=truebuffer=true🔗 Related Issues
AUDIT_TRAIL_ENABLEDkill switch