[PERFORMANCE]: Audit Trail Performance & Configuration Enhancements

# ⚡ Feature: Audit Trail Performance & Configuration Enhancements

## Goal

Enhance the audit trail system with configurable logging levels, retention policies, buffered writes, and storage optimizations to balance compliance requirements with performance. Enable production deployments to handle high load without database exhaustion.

## Why Now?

1. **Load Testing Results**: During 2000 concurrent users test, `audit_trails` accumulated 995,412 rows (788 MB) in 7 hours, causing PostgreSQL memory exhaustion
2. **Kill Switch Insufficient**: `AUDIT_TRAIL_ENABLED=false` is too coarse; need granular control
3. **Compliance Requirements**: Organizations need audit trails but can't sacrifice performance
4. **Storage Costs**: Unbounded table growth increases backup times and storage costs

---

## 📖 User Stories

<details>
<summary>US-1: Operator - Configure Audit Logging Level</summary>

**As a** Platform Operator 
**I want** to configure what operations get logged 
**So that** I can reduce write volume while maintaining compliance

**Acceptance Criteria:**

```gherkin
Scenario: Log writes only (recommended)
 Given AUDIT_TRAIL_LEVEL=writes_only
 When a user creates a tool (CREATE operation)
 Then an audit record should be created
 When a user lists tools (READ operation)
 Then NO audit record should be created

Scenario: Log mutations only
 Given AUDIT_TRAIL_LEVEL=mutations_only
 When a user updates a server (UPDATE operation)
 Then an audit record should be created
 When a user creates a server (CREATE operation)
 Then NO audit record should be created

Scenario: Log all operations
 Given AUDIT_TRAIL_LEVEL=all
 When any CRUD operation occurs
 Then an audit record should be created
```

**Technical Requirements:**
- Add `AUDIT_TRAIL_LEVEL` config with values: `all`, `writes_only`, `mutations_only`, `deletes_only`, `failures_only`
- Default to `writes_only` for new deployments
- Check level before creating audit record

</details>

<details>
<summary>US-2: Operator - Automatic Retention Cleanup</summary>

**As a** Platform Operator 
**I want** old audit records automatically deleted 
**So that** the table doesn't grow unbounded

**Acceptance Criteria:**

```gherkin
Scenario: Retention policy enforced
 Given AUDIT_TRAIL_RETENTION_DAYS=90
 And AUDIT_TRAIL_CLEANUP_ENABLED=true
 And audit records older than 90 days exist
 When the cleanup job runs
 Then records older than 90 days should be deleted
 And records newer than 90 days should be preserved

Scenario: Batch deletion
 Given AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000
 And 50000 records need deletion
 When the cleanup job runs
 Then records should be deleted in batches of 10000
 And database should not lock for extended periods
```

**Technical Requirements:**
- Add `AUDIT_TRAIL_RETENTION_DAYS` config (default: 90)
- Add `AUDIT_TRAIL_CLEANUP_ENABLED` config (default: true)
- Add `AUDIT_TRAIL_CLEANUP_BATCH_SIZE` config (default: 10000)
- Implement background cleanup task

</details>

<details>
<summary>US-3: Developer - Non-Blocking Audit Writes</summary>

**As a** Developer concerned about latency 
**I want** audit writes to not block request handling 
**So that** audit overhead doesn't impact response times

**Acceptance Criteria:**

```gherkin
Scenario: Async audit write
 Given AUDIT_TRAIL_ASYNC=true
 When a tool is created
 Then the API should return immediately
 And the audit record should be written in background
 And the audit write should not add to response latency

Scenario: Buffered audit writes
 Given AUDIT_TRAIL_BUFFER_ENABLED=true
 And AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30
 When 50 audit events occur in 10 seconds
 Then events should be buffered in memory
 And written to DB in a single batch after 30 seconds
```

**Technical Requirements:**
- Add `AUDIT_TRAIL_ASYNC` config (default: true)
- Add buffered write support similar to `MetricsBufferService`
- Write batches to database periodically

</details>

<details>
<summary>US-4: Security Admin - Exclude Specific Actions</summary>

**As a** Security Administrator 
**I want** to exclude noisy actions from audit logging 
**So that** I can focus on meaningful events

**Acceptance Criteria:**

```gherkin
Scenario: Exclude specific actions
 Given AUDIT_TRAIL_EXCLUDE_ACTIONS=health_check,metrics_read
 When a health check endpoint is called
 Then NO audit record should be created
 When a tool is created
 Then an audit record SHOULD be created

Scenario: Exclude specific resources
 Given AUDIT_TRAIL_EXCLUDE_RESOURCES=metric,health
 When metrics are queried
 Then NO audit record should be created
```

**Technical Requirements:**
- Add `AUDIT_TRAIL_EXCLUDE_ACTIONS` config (comma-separated)
- Add `AUDIT_TRAIL_EXCLUDE_RESOURCES` config (comma-separated)
- Check exclusions before creating audit record

</details>

---

## 🏗 Architecture

### Audit Trail Flow with Buffering

```mermaid
sequenceDiagram
 participant API
 participant AuditService
 participant Buffer
 participant DB

 API->>AuditService: log_event(action, resource)
 AuditService->>AuditService: Check level & exclusions
 
 alt Async + Buffered
 AuditService->>Buffer: Add to buffer
 Note over Buffer: Accumulates events
 Buffer->>DB: Batch INSERT (every 30s or 500 events)
 else Sync (legacy)
 AuditService->>DB: INSERT single record
 end
```

### Configuration Hierarchy

```mermaid
flowchart TD
 A[AUDIT_TRAIL_ENABLED] -->|false| Z[No logging]
 A -->|true| B[Check AUDIT_TRAIL_LEVEL]
 B --> C{Operation Type}
 C -->|matches level| D[Check Exclusions]
 C -->|doesn't match| Z
 D -->|excluded| Z
 D -->|not excluded| E[Create Audit Event]
 E --> F{AUDIT_TRAIL_ASYNC?}
 F -->|true| G[Add to Buffer]
 F -->|false| H[Write to DB]
```

---

## 📋 Implementation Tasks

### Phase 1: Configurable Logging Level
- [ ] Add `AUDIT_TRAIL_LEVEL` setting with enum values
- [ ] Implement level checking in audit service
- [ ] Add level to audit record for filtering
- [ ] Update existing callers to include operation type

### Phase 2: Async/Buffered Writes
- [ ] Add `AUDIT_TRAIL_ASYNC` setting
- [ ] Add `AUDIT_TRAIL_BUFFER_ENABLED` setting
- [ ] Add `AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL` setting
- [ ] Add `AUDIT_TRAIL_BUFFER_MAX_SIZE` setting
- [ ] Implement `AuditBufferService` (similar to MetricsBufferService)
- [ ] Add graceful shutdown to flush pending events

### Phase 3: Retention Policy
- [ ] Add `AUDIT_TRAIL_RETENTION_DAYS` setting
- [ ] Add `AUDIT_TRAIL_CLEANUP_ENABLED` setting
- [ ] Add `AUDIT_TRAIL_CLEANUP_BATCH_SIZE` setting
- [ ] Implement background cleanup task
- [ ] Add cleanup metrics (records deleted, duration)

### Phase 4: Exclusions
- [ ] Add `AUDIT_TRAIL_EXCLUDE_ACTIONS` setting
- [ ] Add `AUDIT_TRAIL_EXCLUDE_RESOURCES` setting
- [ ] Implement exclusion checking in audit service

### Phase 5: Sampling (Optional)
- [ ] Add `AUDIT_TRAIL_SAMPLE_RATE` setting
- [ ] Add `AUDIT_TRAIL_SAMPLE_READS` setting
- [ ] Implement probabilistic sampling for high-volume events

### Phase 6: Documentation
- [ ] Create `docs/docs/manage/audit-trails.md`
- [ ] Document all configuration options
- [ ] Add compliance guide (SOC2, HIPAA, GDPR)
- [ ] Add troubleshooting section

### Phase 7: Testing
- [ ] Unit tests for level filtering
- [ ] Unit tests for buffer service
- [ ] Unit tests for cleanup task
- [ ] Integration tests for end-to-end flow
- [ ] Performance tests comparing sync vs async

---

## ⚙️ Configuration Example

```bash
# Recommended Production Configuration
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=writes_only # Skip reads (90% reduction)
AUDIT_TRAIL_ASYNC=true # Non-blocking writes
AUDIT_TRAIL_BUFFER_ENABLED=true # Batch writes
AUDIT_TRAIL_BUFFER_FLUSH_INTERVAL=30 # Flush every 30s
AUDIT_TRAIL_BUFFER_MAX_SIZE=500 # Or when 500 events buffered
AUDIT_TRAIL_RETENTION_DAYS=90 # Keep 3 months
AUDIT_TRAIL_CLEANUP_ENABLED=true # Auto-cleanup old records
AUDIT_TRAIL_CLEANUP_BATCH_SIZE=10000 # Delete in batches

# Load Testing / Development
AUDIT_TRAIL_ENABLED=false

# High-Security Compliance
AUDIT_TRAIL_ENABLED=true
AUDIT_TRAIL_LEVEL=all # Log everything
AUDIT_TRAIL_RETENTION_DAYS=365 # Keep 1 year
```

---

## ✅ Success Criteria

- [ ] `AUDIT_TRAIL_LEVEL=writes_only` reduces writes by ~90%
- [ ] Async writes eliminate audit latency from request path
- [ ] Buffered writes reduce DB round-trips by ~95%
- [ ] Retention cleanup prevents unbounded table growth
- [ ] Exclusions allow filtering noisy events
- [ ] Load test with 2000 users passes without DB exhaustion
- [ ] All configurations documented in `.env.example`
- [ ] Compliance documentation created

---

## 🏁 Definition of Done

- [ ] Logging level configuration implemented
- [ ] Async/buffered writes implemented
- [ ] Retention cleanup implemented
- [ ] Exclusions implemented
- [ ] Unit tests written and passing
- [ ] Integration tests written and passing
- [ ] Performance tests show improvement
- [ ] Code passes `make verify`
- [ ] Documentation created
- [ ] Settings documented in `.env.example`
- [ ] PR reviewed and approved

---

## 📝 Additional Notes

### Implementation Priority

| Enhancement | Effort | Impact | Priority |
|-------------|--------|--------|----------|
| Logging Level (`writes_only`) | Low | High | P0 |
| Async Writes | Low | High | P0 |
| Exclusions | Low | Medium | P1 |
| Buffered Writes | Medium | High | P1 |
| Retention Policy | Medium | Medium | P1 |
| Sampling | Low | Medium | P2 |

### Performance Expectations

| Configuration | Write Reduction | Latency Impact |
|---------------|-----------------|----------------|
| `level=writes_only` | ~90% | None |
| `async=true` | 0% | -50ms avg |
| `buffer=true` | 0% | -95% DB calls |
| Combined | ~90% | Near zero |

---

## 🔗 Related Issues

- Issue #1743 - Added `AUDIT_TRAIL_ENABLED` kill switch
- Load testing results documentation
- Compliance requirements documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERFORMANCE]: Audit Trail Performance & Configuration Enhancements #1745

⚡ Feature: Audit Trail Performance & Configuration Enhancements

Goal

Why Now?

📖 User Stories

🏗 Architecture

Audit Trail Flow with Buffering

Configuration Hierarchy

📋 Implementation Tasks

Phase 1: Configurable Logging Level

Phase 2: Async/Buffered Writes

Phase 3: Retention Policy

Phase 4: Exclusions

Phase 5: Sampling (Optional)

Phase 6: Documentation

Phase 7: Testing

⚙️ Configuration Example

✅ Success Criteria

🏁 Definition of Done

📝 Additional Notes

Implementation Priority

Performance Expectations

🔗 Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement	Effort	Impact	Priority
Logging Level (`writes_only`)	Low	High	P0
Async Writes	Low	High	P0
Exclusions	Low	Medium	P1
Buffered Writes	Medium	High	P1
Retention Policy	Medium	Medium	P1
Sampling	Low	Medium	P2

Configuration	Write Reduction	Latency Impact
`level=writes_only`	~90%	None
`async=true`	0%	-50ms avg
`buffer=true`	0%	-95% DB calls
Combined	~90%	Near zero

[PERFORMANCE]: Audit Trail Performance & Configuration Enhancements #1745

Description

⚡ Feature: Audit Trail Performance & Configuration Enhancements

Goal

Why Now?

📖 User Stories

🏗 Architecture

Audit Trail Flow with Buffering

Configuration Hierarchy

📋 Implementation Tasks

Phase 1: Configurable Logging Level

Phase 2: Async/Buffered Writes

Phase 3: Retention Policy

Phase 4: Exclusions

Phase 5: Sampling (Optional)

Phase 6: Documentation

Phase 7: Testing

⚙️ Configuration Example

✅ Success Criteria

🏁 Definition of Done

📝 Additional Notes

Implementation Priority

Performance Expectations

🔗 Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions