-
Notifications
You must be signed in to change notification settings - Fork 614
[PERFORMANCE]: Audit Trail Performance & Configuration Enhancements #1745
Description
⚡ Feature: Audit Trail Performance & Configuration Enhancements
Goal
Enhance the audit trail system with configurable logging levels, retention policies, buffered writes, and storage optimizations to balance compliance requirements with performance. Enable production deployments to handle high load without database exhaustion.
Why Now?
- Load Testing Results: During 2000 concurrent users test,
audit_trailsaccumulated 995,412 rows (788 MB) in 7 hours, causing PostgreSQL memory exhaustion - Kill Switch Insufficient:
AUDIT_TRAIL_ENABLED=falseis too coarse; need granular control - Compliance Requirements: Organizations need audit trails but can't sacrifice performance
- Storage Costs: Unbounded table growth increases backup times and storage costs
📖 User Stories
US-1: Operator - Configure Audit Logging Level
As a Platform Operator
I want to configure what operations get logged
So that I can reduce write volume while maintaining compliance
Acceptance Criteria:
Scenario: Log writes only (recommended)
Given AUDIT_TRAIL_LEVEL=writes_only
When a user creates a tool (CREATE operation)
Then an audit record should be created
When a user lists tools (READ operation)
Then NO audit record should be created
Scenario: Log mutations only
Given AUDIT_TRAIL_LEVEL=mutations_only
When a user updates a server (UPDATE operation)
Then an audit record should be created
When a user creates a server (CREATE operation)
Then NO audit record should be created
Scenario: Log all operations
Given AUDIT_TRAIL_LEVEL=all
When any CRUD operation occurs
Then an audit record should be createdTechnical Requirements:
- Add
AUDIT_TRAIL_LEVELconfig with values:all,writes_only,mutations_only,deletes_only,failures_only - Default to
writes_onlyfor new deployments - Check level before creating audit record
US-2: Operator - Automatic Retention Cleanup
As a Platform Operator
I want old audit records automatically deleted
So that the table doesn't grow unbounded
Acceptance Criteria:
Scenario: Retention policy enforced
Given AUDIT_TRAIL_RETENTION_DAYS=90
And AUDIT_TRAIL_CLEANUP_ENABLED=true
And audit records older than 90 days exist
When the cleanup job runs
Then records older than 90 days should be deleted
And records newer than 90 days should be preserved
Scenario: Batch deletion
Given AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000
And 50000 records need deletion
When the cleanup job runs
Then records should be deleted in batches of 10000
And database should not lock for extended periodsTechnical Requirements:
- Add
AUDIT_TRAIL_RETENTION_DAYSconfig (default: 90) - Add
AUDIT_TRAIL_CLEANUP_ENABLEDconfig (default: true) - Add
AUDIT_TRAIL_CLEANUP_BATCH_SIZEconfig (default: 10000) - Implement background cleanup task
US-3: Developer - Non-Blocking Audit Writes
As a Developer concerned about latency
I want audit writes to not block request handling
So that audit overhead doesn't impact response times
Acceptance Criteria:
Scenario: Async audit write
Given AUDIT_TRAIL_ASYNC=true
When a tool is created
Then the API should return immediately
And the audit record should be written in background
And the audit write should not add to response latency
Scenario: Buffered audit writes
Given AUDIT_TRAIL_BUFFER_ENABLED=true
And AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30
When 50 audit events occur in 10 seconds
Then events should be buffered in memory
And written to DB in a single batch after 30 secondsTechnical Requirements:
- Add
AUDIT_TRAIL_ASYNCconfig (default: true) - Add buffered write support similar to
MetricsBufferService - Write batches to database periodically
US-4: Security Admin - Exclude Specific Actions
As a Security Administrator
I want to exclude noisy actions from audit logging
So that I can focus on meaningful events
Acceptance Criteria:
Scenario: Exclude specific actions
Given AUDIT_TRAIL_EXCLUDE_ACTIONS=health_check,metrics_read
When a health check endpoint is called
Then NO audit record should be created
When a tool is created
Then an audit record SHOULD be created
Scenario: Exclude specific resources
Given AUDIT_TRAIL_EXCLUDE_RESOURCES=metric,health
When metrics are queried
Then NO audit record should be createdTechnical Requirements:
- Add
AUDIT_TRAIL_EXCLUDE_ACTIONSconfig (comma-separated) - Add
AUDIT_TRAIL_EXCLUDE_RESOURCESconfig (comma-separated) - Check exclusions before creating audit record
🏗 Architecture
Audit Trail Flow with Buffering
sequenceDiagram
participant API
participant AuditService
participant Buffer
participant DB
API->>AuditService: log_event(action, resource)
AuditService->>AuditService: Check level & exclusions
alt Async + Buffered
AuditService->>Buffer: Add to buffer
Note over Buffer: Accumulates events
Buffer->>DB: Batch INSERT (every 30s or 500 events)
else Sync (legacy)
AuditService->>DB: INSERT single record
end
Configuration Hierarchy
flowchart TD
A[AUDIT_TRAIL_ENABLED] -->|false| Z[No logging]
A -->|true| B[Check AUDIT_TRAIL_LEVEL]
B --> C{Operation Type}
C -->|matches level| D[Check Exclusions]
C -->|doesn't match| Z
D -->|excluded| Z
D -->|not excluded| E[Create Audit Event]
E --> F{AUDIT_TRAIL_ASYNC?}
F -->|true| G[Add to Buffer]
F -->|false| H[Write to DB]
📋 Implementation Tasks
Phase 1: Configurable Logging Level
- Add
AUDIT_TRAIL_LEVELsetting with enum values - Implement level checking in audit service
- Add level to audit record for filtering
- Update existing callers to include operation type
Phase 2: Async/Buffered Writes
- Add
AUDIT_TRAIL_ASYNCsetting - Add
AUDIT_TRAIL_BUFFER_ENABLEDsetting - Add
AUDIT_TRAIL_BUFFER_FLUSH_INTERVALsetting - Add
AUDIT_TRAIL_BUFFER_MAX_SIZEsetting - Implement
AuditBufferService(similar to MetricsBufferService) - Add graceful shutdown to flush pending events
Phase 3: Retention Policy
- Add
AUDIT_TRAIL_RETENTION_DAYSsetting - Add
AUDIT_TRAIL_CLEANUP_ENABLEDsetting - Add
AUDIT_TRAIL_CLEANUP_BATCH_SIZEsetting - Implement background cleanup task
- Add cleanup metrics (records deleted, duration)
Phase 4: Exclusions
- Add
AUDIT_TRAIL_EXCLUDE_ACTIONSsetting - Add
AUDIT_TRAIL_EXCLUDE_RESOURCESsetting - Implement exclusion checking in audit service
Phase 5: Sampling (Optional)
- Add
AUDIT_TRAIL_SAMPLE_RATEsetting - Add
AUDIT_TRAIL_SAMPLE_READSsetting - Implement probabilistic sampling for high-volume events
Phase 6: Documentation
- Create
docs/docs/manage/audit-trails.md - Document all configuration options
- Add compliance guide (SOC2, HIPAA, GDPR)
- Add troubleshooting section
Phase 7: Testing
- Unit tests for level filtering
- Unit tests for buffer service
- Unit tests for cleanup task
- Integration tests for end-to-end flow
- Performance tests comparing sync vs async
⚙️ Configuration Example
# Recommended Production Configuration
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=writes_only # Skip reads (90% reduction)
AUDIT_TRAIL_ASYNC=true # Non-blocking writes
AUDIT_TRAIL_BUFFER_ENABLED=true # Batch writes
AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30 # Flush every 30s
AUDIT_TRAIL_BUFFER_MAX_SIZE=500 # Or when 500 events buffered
AUDIT_TRAIL_RETENTION_DAYS=90 # Keep 3 months
AUDIT_TRAIL_CLEANUP_ENABLED=true # Auto-cleanup old records
AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000 # Delete in batches
# Load Testing / Development
AUDIT_TRAIL_ENABLED=false
# High-Security Compliance
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=all # Log everything
AUDIT_TRAIL_RETENTION_DAYS=365 # Keep 1 year✅ Success Criteria
-
AUDIT_TRAIL_LEVEL=writes_onlyreduces writes by ~90% - Async writes eliminate audit latency from request path
- Buffered writes reduce DB round-trips by ~95%
- Retention cleanup prevents unbounded table growth
- Exclusions allow filtering noisy events
- Load test with 2000 users passes without DB exhaustion
- All configurations documented in
.env.example - Compliance documentation created
🏁 Definition of Done
- Logging level configuration implemented
- Async/buffered writes implemented
- Retention cleanup implemented
- Exclusions implemented
- Unit tests written and passing
- Integration tests written and passing
- Performance tests show improvement
- Code passes
make verify - Documentation created
- Settings documented in
.env.example - PR reviewed and approved
📝 Additional Notes
Implementation Priority
| Enhancement | Effort | Impact | Priority |
|---|---|---|---|
Logging Level (writes_only) |
Low | High | P0 |
| Async Writes | Low | High | P0 |
| Exclusions | Low | Medium | P1 |
| Buffered Writes | Medium | High | P1 |
| Retention Policy | Medium | Medium | P1 |
| Sampling | Low | Medium | P2 |
Performance Expectations
| Configuration | Write Reduction | Latency Impact |
|---|---|---|
level=writes_only |
~90% | None |
async=true |
0% | -50ms avg |
buffer=true |
0% | -95% DB calls |
| Combined | ~90% | Near zero |
🔗 Related Issues
- Issue Add AUDIT_TRAIL_ENABLED flag to disable audit trail logging for performance #1743 - Added
AUDIT_TRAIL_ENABLEDkill switch - Load testing results documentation
- Compliance requirements documentation