[EPIC][OBSERVABILITY]: Internal observability system - Performance monitoring

# 📊 Epic: Internal Observability System - Performance Monitoring & Trace Analytics

## Goal

Implement a **comprehensive, self-contained observability system** that provides performance monitoring, error tracking, and trace analytics for MCP Gateway operations **without requiring external platforms**. Enable operators to monitor tool invocations, prompt rendering, resource fetching, and HTTP requests through an integrated Admin UI with interactive visualizations, detailed metrics, and trace exploration capabilities.

## Why Now?

As ContextForge deployments scale to production environments with hundreds of tools, prompts, and resources serving multiple clients, operators need visibility into:

1. **Performance Bottlenecks**: Which tools/prompts/resources are slow? What's the p99 latency?
2. **Error Tracking**: Which operations are failing? What's the error rate trend?
3. **Usage Patterns**: Which tools are most frequently invoked? Which prompts are rarely used?
4. **Request Tracing**: Complete visibility into HTTP request/response cycles with timing breakdown
5. **Root Cause Analysis**: Detailed trace inspection with spans, events, and attributes
6. **Self-Hosted Solution**: No dependency on external observability platforms (Phoenix, Jaeger, etc.)

Current limitations create operational blind spots:

- No visibility into tool invocation performance
- No error rate tracking for prompts and resources
- No request tracing for debugging
- Dependency on external observability platforms increases infrastructure complexity
- No historical performance data for trend analysis

By implementing an internal observability system with database-backed storage and Admin UI visualizations, operators gain complete performance visibility while maintaining infrastructure simplicity.

---

## 📖 User Stories

<details>
<summary>US-1: SRE - Monitor Tool Performance Metrics</summary>

**As an** SRE
**I want** to view performance metrics for all MCP tools
**So that** I can identify slow tools and optimize critical paths

**Acceptance Criteria:**

```gherkin
Given observability is enabled (OBSERVABILITY_ENABLED=true)
And tools have been invoked multiple times

When I navigate to /admin/observability/tools
Then I should see a dashboard with:
 - Summary cards showing:
 - Overall health (success rate %)
 - Most used tool (by invocation count)
 - Slowest tool (by p99 latency)
 - Most error-prone tool (by error rate %)
 
 - Performance charts:
 - Tool usage bar chart (invocation counts)
 - Average latency bar chart (milliseconds)
 - Error rate chart (percentage with color coding)
 - Top N error-prone tools chart
 
 - Detailed metrics table with columns:
 - Tool Name
 - Invocation Count
 - Latency Percentiles (p50, p90, p95, p99)
 - Error Rate (%)
 - Last Used timestamp

When I filter by "Last 1 hour"
Then only metrics from the past hour should display

When I select "Top 50" limit
Then the table should show at most 50 tools

When I enable auto-refresh
Then the dashboard should update every 60 seconds
```

**Technical Requirements:**
- Query traces table with aggregations (COUNT, AVG, percentile calculations)
- Color-coded health indicators: green (<5% errors), yellow (5-20%), red (>20%)
- Chart.js for interactive visualizations
- HTMX for auto-refresh without page reload
- SQL performance indexes for fast queries

</details>

<details>
<summary>US-2: Developer - View Detailed Request Traces</summary>

**As a** Developer
**I want** to view detailed traces for individual HTTP requests
**So that** I can debug performance issues and understand request flow

**Acceptance Criteria:**

```gherkin
Given HTTP request tracing is enabled (OBSERVABILITY_TRACE_HTTP_REQUESTS=true)
And I made a POST request to /mcp/tools/invoke

When I navigate to /admin/observability/metrics
And I click on a trace in the "Recent Traces" list

Then I should see a trace detail page showing:
 - Trace metadata:
 - Trace ID
 - Start time and end time
 - Total duration (milliseconds)
 - Status (success/error)
 
 - Gantt chart timeline:
 - Visual timeline of all spans
 - Span hierarchy (parent-child relationships)
 - Duration bars with relative positioning
 - Critical path highlighting
 
 - Flame graph:
 - Hierarchical visualization of nested operations
 - Color-coded by operation type
 - Interactive zoom and search
 - Time-proportional rectangles
 
 - Span details table:
 - Span name and operation
 - Start time and duration
 - Status and attributes
 - Events (if any)

When I click on a span in the Gantt chart
Then the span details should scroll into view and highlight

When I search for "database" in the flame graph
Then matching spans should be highlighted
```

**Technical Requirements:**
- Store traces with spans, events, and attributes in SQLite/PostgreSQL
- Generate trace IDs and span IDs (UUIDs)
- Implement Gantt chart with D3.js or vanilla JS
- Implement flame graph with hierarchical rendering
- Support span parent-child relationships
- Calculate critical path (longest dependency chain)

</details>

<details>
<summary>US-3: Platform Admin - Configure Observability Settings</summary>

**As a** Platform Administrator
**I want** to configure observability retention, sampling, and exclusions
**So that** I can control storage costs and trace noise

**Acceptance Criteria:**

```gherkin
Given I have access to .env configuration

When I set:
 OBSERVABILITY_ENABLED=true
 OBSERVABILITY_TRACE_RETENTION_DAYS=7
 OBSERVABILITY_MAX_TRACES=100000
 OBSERVABILITY_SAMPLE_RATE=0.5
 OBSERVABILITY_EXCLUDE_PATHS=/health,/metrics,/static/.*

Then the observability system should:
 - Automatically delete traces older than 7 days
 - Limit total traces to 100,000 (FIFO deletion when exceeded)
 - Sample 50% of requests (randomly trace half of all requests)
 - Exclude /health, /metrics, and /static/* paths from tracing

When I restart MCP Gateway
Then the new configuration should take effect

When I check the database
Then I should see:
 - No traces older than 7 days
 - Total trace count ≤ 100,000
 - No traces for /health, /metrics, or /static/* paths
```

**Technical Requirements:**
- Automatic cleanup job for trace retention
- FIFO trace deletion when max limit exceeded
- Probabilistic sampling based on sample rate
- Regex-based path exclusion matching
- Validation: retention 1-365 days, sample rate 0.0-1.0
- Environment variable parsing and defaults

</details>

<details>
<summary>US-4: SRE - Monitor Prompt Rendering Performance</summary>

**As an** SRE
**I want** to view performance metrics for MCP prompt rendering
**So that** I can identify slow prompts and optimize template rendering

**Acceptance Criteria:**

```gherkin
Given prompts have been rendered multiple times

When I navigate to /admin/observability/prompts
Then I should see:
 - Summary cards:
 - Overall rendering health (success rate)
 - Most used prompt
 - Slowest prompt (p99 latency)
 - Most error-prone prompt
 
 - Charts:
 - Prompt render frequency (usage distribution)
 - Average latency per prompt
 - Error rate percentage
 - Top N error-prone prompts
 
 - Metrics table:
 - Prompt Name
 - Render Count
 - Latency Percentiles (p50, p90, p95, p99)
 - Error Rate (%)
 - Last Rendered timestamp

When I filter by "Last 24 hours"
Then only recent prompt metrics should display

When a prompt has >20% error rate
Then it should be highlighted in red
```

**Technical Requirements:**
- Track prompt_get_prompt operations via instrumentation
- Store prompt name, duration, status in traces/spans
- Aggregate metrics with percentile calculations
- Color-coded health indicators
- Auto-refresh support

</details>

<details>
<summary>US-5: Developer - Monitor Resource Fetch Performance</summary>

**As a** Developer
**I want** to view performance metrics for MCP resource fetching
**So that** I can optimize resource loading and identify slow resources

**Acceptance Criteria:**

```gherkin
Given resources have been fetched multiple times

When I navigate to /admin/observability/resources
Then I should see:
 - Summary cards:
 - Overall fetch health
 - Most used resource (by fetch count)
 - Slowest resource (p99 latency)
 - Most error-prone resource
 
 - Charts:
 - Resource fetch frequency
 - Average fetch latency
 - Error rate distribution
 - Top N error-prone resources
 
 - Metrics table:
 - Resource URI
 - Fetch Count
 - Latency Percentiles (p50, p90, p95, p99)
 - Error Rate (%)
 - Last Fetched timestamp

When I click on a resource URI
Then I should see detailed traces for that resource

When I filter by time range "Last 7 days"
Then metrics should aggregate data from the past week
```

**Technical Requirements:**
- Track resource_read operations via instrumentation
- Store resource URI, duration, status in spans
- Support filtering and aggregation
- Link to detailed trace views

</details>

<details>
<summary>US-6: Operations Engineer - Analyze Error Trends</summary>

**As an** Operations Engineer
**I want** to identify error-prone operations and view error trends
**So that** I can proactively fix reliability issues

**Acceptance Criteria:**

```gherkin
Given multiple operations have failed

When I navigate to /admin/observability/metrics
Then I should see:
 - Overall error rate percentage
 - List of top N error-prone operations:
 - Operation name
 - Error count
 - Error rate percentage
 - Example error message
 
 - Recent failed traces:
 - Trace ID (clickable)
 - Operation name
 - Timestamp
 - Error message
 - Duration

When I click on a failed trace
Then I should see:
 - Full trace details with all spans
 - Error status highlighted
 - Error message and stack trace (if available)
 - Span attributes showing error context

When an operation has >50% error rate
Then it should be highlighted with a critical alert color
```

**Technical Requirements:**
- Query traces with status = "error"
- Aggregate error counts per operation
- Calculate error rate percentage
- Display error messages and attributes
- Highlight critical error rates (>50%)

</details>

<details>
<summary>US-7: Developer - Visualize Request Flow with Flame Graphs</summary>

**As a** Developer
**I want** to visualize request execution flow using flame graphs
**So that** I can identify nested operation bottlenecks

**Acceptance Criteria:**

```gherkin
Given a trace has nested spans (parent-child hierarchy)

When I navigate to a trace detail page
And I view the flame graph section

Then I should see:
 - Hierarchical flame graph with:
 - Root span at the bottom
 - Child spans stacked above parents
 - Width proportional to duration
 - Color-coded by operation type
 
 - Interactive features:
 - Hover to see span details (name, duration)
 - Click to zoom into span
 - Search box to find spans by name
 - Reset zoom button
 
 - Legend showing:
 - Color mapping for operation types
 - Total trace duration
 - Number of spans

When I search for "database"
Then all spans containing "database" should be highlighted

When I click on a span
Then the view should zoom to show that span and its children

When I click "Reset"
Then the view should zoom out to show the full trace
```

**Technical Requirements:**
- Render flame graph using SVG or Canvas
- Calculate hierarchical layout (parent-child positioning)
- Proportional width based on span duration
- Interactive zoom and pan
- Search with highlighting
- Color palette for operation types

</details>

<details>
<summary>US-8: SRE - View Request Timeline with Gantt Charts</summary>

**As an** SRE
**I want** to view request execution timeline using Gantt charts
**So that** I can understand operation sequencing and parallelism

**Acceptance Criteria:**

```gherkin
Given a trace has multiple spans with start/end times

When I view the trace detail page
And I view the Gantt chart section

Then I should see:
 - Timeline with:
 - Time scale (x-axis) from trace start to end
 - Span rows (y-axis) showing each operation
 - Duration bars positioned by start time
 - Color-coded by span status (success/error)
 
 - Critical path highlighting:
 - Longest dependency chain highlighted
 - Critical spans marked with red border
 
 - Interactive features:
 - Hover to see span details
 - Click to view span attributes
 - Zoom and pan timeline
 - Expand/collapse nested spans

When I hover over a span bar
Then I should see a tooltip with:
 - Span name
 - Start time (relative to trace start)
 - Duration (milliseconds)
 - Status

When I click on a span bar
Then the span details table should scroll to that span
```

**Technical Requirements:**
- Render Gantt chart using SVG or Canvas
- Calculate relative positioning (start time offset)
- Proportional bar width based on duration
- Critical path calculation (longest chain)
- Interactive tooltip and click handlers
- Expand/collapse for nested spans

</details>

<details>
<summary>US-9: Platform Admin - Auto-Cleanup Old Traces</summary>

**As a** Platform Administrator
**I want** automatic cleanup of old traces based on retention policy
**So that** database storage doesn't grow unbounded

**Acceptance Criteria:**

```gherkin
Given OBSERVABILITY_TRACE_RETENTION_DAYS=7
And OBSERVABILITY_MAX_TRACES=100000

When the cleanup job runs (hourly or daily)
Then it should:
 - Delete traces older than 7 days
 - Delete oldest traces if total count > 100,000
 - Log cleanup actions (traces deleted, retention applied)

When I query the traces table
Then I should see:
 - No traces with created_at > 7 days ago
 - Total trace count ≤ 100,000

When the database has 150,000 traces
Then the cleanup job should:
 - Delete the oldest 50,000 traces (FIFO)
 - Keep the newest 100,000 traces
```

**Technical Requirements:**
- Scheduled cleanup job (APScheduler or cron)
- DELETE query with timestamp filter
- FIFO deletion (ORDER BY created_at ASC LIMIT N)
- Logging with trace count and retention stats
- Configurable cleanup interval

</details>

<details>
<summary>US-10: Developer - SQLAlchemy Query Instrumentation</summary>

**As a** Developer
**I want** SQLAlchemy queries to be automatically traced
**So that** I can see database operation performance in traces

**Acceptance Criteria:**

```gherkin
Given SQLAlchemy instrumentation is enabled

When I invoke a tool that queries the database
Then the trace should include spans for:
 - SQL query execution
 - Query duration (milliseconds)
 - Query text (sanitized)
 - Table name
 - Row count (if applicable)

When I view the trace detail
Then I should see:
 - Span name: "db.query" or "db.execute"
 - Attributes:
 - db.statement: SQL query text
 - db.system: "sqlite" or "postgresql"
 - db.operation: "SELECT" or "INSERT" or "UPDATE"
 - Duration showing query execution time

When a query takes >1000ms
Then the span should be highlighted as slow
```

**Technical Requirements:**
- SQLAlchemy event listeners (before_cursor_execute, after_cursor_execute)
- Create spans for queries
- Capture query text, parameters, duration
- Sanitize sensitive data (passwords, tokens)
- Set span attributes with db.* namespace

</details>

---

## 🏗 Architecture

### Component Structure

```
Observability System
├── Middleware
│ ├── ObservabilityMiddleware (HTTP request tracing)
│ └── AuthMiddleware (user context propagation)
│
├── Instrumentation
│ ├── SQLAlchemy (database query tracing)
│ └── Custom decorators (tool/prompt/resource tracing)
│
├── Storage Layer
│ ├── Traces table (trace metadata)
│ ├── Spans table (operation details)
│ ├── Span Events table (log events)
│ └── Span Attributes table (key-value pairs)
│
├── Service Layer
│ ├── ObservabilityService (trace CRUD, metrics aggregation)
│ └── Cleanup service (retention enforcement)
│
├── Router Layer
│ └── ObservabilityRouter (/admin/observability/*)
│
└── UI Layer
 ├── Metrics dashboard (summary cards, charts)
 ├── Tools dashboard (tool performance)
 ├── Prompts dashboard (prompt performance)
 ├── Resources dashboard (resource performance)
 ├── Trace list (recent traces)
 ├── Trace detail (Gantt chart, flame graph, spans)
 └── Interactive visualizations (Chart.js, custom SVG)
```

### Database Schema

```sql
CREATE TABLE observability_traces (
 id INTEGER PRIMARY KEY,
 trace_id TEXT UNIQUE NOT NULL,
 name TEXT NOT NULL,
 start_time REAL NOT NULL,
 end_time REAL,
 duration_ms REAL,
 status TEXT, -- success|error
 created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
 INDEX idx_trace_id (trace_id),
 INDEX idx_start_time (start_time),
 INDEX idx_status (status)
);

CREATE TABLE observability_spans (
 id INTEGER PRIMARY KEY,
 trace_id TEXT NOT NULL,
 span_id TEXT UNIQUE NOT NULL,
 parent_span_id TEXT,
 name TEXT NOT NULL,
 start_time REAL NOT NULL,
 end_time REAL,
 duration_ms REAL,
 status TEXT,
 operation TEXT, -- tool_invoke|prompt_render|resource_fetch|http_request
 created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
 FOREIGN KEY (trace_id) REFERENCES observability_traces(trace_id),
 INDEX idx_trace_id (trace_id),
 INDEX idx_span_id (span_id),
 INDEX idx_operation (operation)
);

CREATE TABLE observability_span_attributes (
 id INTEGER PRIMARY KEY,
 span_id TEXT NOT NULL,
 key TEXT NOT NULL,
 value TEXT,
 FOREIGN KEY (span_id) REFERENCES observability_spans(span_id),
 INDEX idx_span_id (span_id)
);

CREATE TABLE observability_span_events (
 id INTEGER PRIMARY KEY,
 span_id TEXT NOT NULL,
 timestamp REAL NOT NULL,
 name TEXT NOT NULL,
 attributes TEXT, -- JSON
 FOREIGN KEY (span_id) REFERENCES observability_spans(span_id),
 INDEX idx_span_id (span_id)
);
```

### Trace Flow

1. **HTTP Request** → ObservabilityMiddleware creates trace + root span
2. **Tool Invocation** → ToolService creates child span (operation: tool_invoke)
3. **Database Query** → SQLAlchemy instrumentation creates child span (operation: db.query)
4. **Response** → Middleware ends root span, calculates duration, stores trace

### Metrics Aggregation

```sql
-- Tool metrics (p50, p90, p95, p99 latencies)
SELECT 
 name,
 COUNT(*) as invocation_count,
 AVG(duration_ms) as avg_latency,
 PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) as p50,
 PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY duration_ms) as p90,
 PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95,
 PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99,
 SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as error_rate
FROM observability_spans
WHERE operation = 'tool_invoke'
 AND start_time > <time_filter>
GROUP BY name
ORDER BY invocation_count DESC;
```

---

## 📋 Implementation Tasks

- [x] **Database Schema & Migrations**
 - [x] Create observability_traces table
 - [x] Create observability_spans table
 - [x] Create observability_span_attributes table
 - [x] Create observability_span_events table
 - [x] Add performance indexes (trace_id, span_id, operation, start_time)
 - [x] Add saved query views for common metrics
 - [x] Alembic migration: a23a08d61eb0_add_observability_tables.py
 - [x] Alembic migration: i3c4d5e6f7g8_add_observability_performance_indexes.py
 - [x] Alembic migration: j4d5e6f7g8h9_add_observability_saved_queries.py

- [x] **Configuration & Settings**
 - [x] Add OBSERVABILITY_ENABLED setting (default: false)
 - [x] Add OBSERVABILITY_TRACE_HTTP_REQUESTS (default: true)
 - [x] Add OBSERVABILITY_TRACE_RETENTION_DAYS (default: 7)
 - [x] Add OBSERVABILITY_MAX_TRACES (default: 100000)
 - [x] Add OBSERVABILITY_SAMPLE_RATE (default: 1.0)
 - [x] Add OBSERVABILITY_EXCLUDE_PATHS (default: /health,/metrics,/static/.*)
 - [x] Add OBSERVABILITY_METRICS_ENABLED (default: true)
 - [x] Add OBSERVABILITY_EVENTS_ENABLED (default: true)
 - [x] Update .env.example with all observability settings
 - [x] Update config.py with validation and defaults
 - [x] Update Helm chart values.yaml with observability config

- [x] **Middleware Implementation**
 - [x] Create ObservabilityMiddleware for HTTP request tracing
 - [x] Implement trace creation (trace_id, root span)
 - [x] Implement span creation with start_time
 - [x] Implement span ending with end_time, duration calculation
 - [x] Implement sampling logic based on OBSERVABILITY_SAMPLE_RATE
 - [x] Implement path exclusion regex matching
 - [x] Store request method, path, status_code as span attributes
 - [x] Handle exceptions and mark spans as error status
 - [x] Integrate with FastAPI middleware stack

- [x] **SQLAlchemy Instrumentation**
 - [x] Create SQLAlchemy event listeners (before/after execute)
 - [x] Create spans for database queries
 - [x] Capture query text, operation (SELECT/INSERT/UPDATE)
 - [x] Capture query duration
 - [x] Set span attributes (db.statement, db.system, db.operation)
 - [x] Sanitize sensitive data in queries
 - [x] Handle query errors and mark spans as failed

- [x] **Service Layer - ObservabilityService**
 - [x] Implement create_trace(name, start_time) → trace_id
 - [x] Implement create_span(trace_id, name, operation, parent_span_id)
 - [x] Implement end_span(span_id, end_time, status, attributes)
 - [x] Implement add_span_event(span_id, name, timestamp, attributes)
 - [x] Implement get_traces(limit, offset, time_filter)
 - [x] Implement get_trace_by_id(trace_id) with spans
 - [x] Implement get_tool_metrics(time_filter, limit)
 - [x] Implement get_prompt_metrics(time_filter, limit)
 - [x] Implement get_resource_metrics(time_filter, limit)
 - [x] Implement cleanup_old_traces(retention_days, max_traces)
 - [x] Implement percentile calculations (p50, p90, p95, p99)

- [x] **Router Layer - ObservabilityRouter**
 - [x] Create /admin/observability route
 - [x] Create /admin/observability/metrics route (summary dashboard)
 - [x] Create /admin/observability/tools route (tool metrics)
 - [x] Create /admin/observability/prompts route (prompt metrics)
 - [x] Create /admin/observability/resources route (resource metrics)
 - [x] Create /admin/observability/traces route (trace list)
 - [x] Create /admin/observability/traces/{trace_id} route (trace detail)
 - [x] Implement time filter query params (1h, 24h, 7d, 30d)
 - [x] Implement limit query param (10, 20, 50, 100)
 - [x] Implement auto-refresh query param

- [x] **Templates - Dashboard UI**
 - [x] Create observability_metrics.html (summary dashboard)
 - [x] Create observability_tools.html (tool metrics dashboard)
 - [x] Create observability_prompts.html (prompt metrics dashboard)
 - [x] Create observability_resources.html (resource metrics dashboard)
 - [x] Create observability_traces_list.html (recent traces)
 - [x] Create observability_trace_detail.html (trace details)
 - [x] Create observability_partial.html (shared components)
 - [x] Create observability_stats.html (summary cards)
 - [x] Add summary cards (health, most used, slowest, error-prone)
 - [x] Add time filter dropdown (1h, 24h, 7d, 30d)
 - [x] Add limit dropdown (10, 20, 50, 100)
 - [x] Add auto-refresh toggle (60s interval)
 - [x] Add HTMX for dynamic updates

- [x] **Charts & Visualizations**
 - [x] Implement Chart.js bar charts (usage, latency, error rate)
 - [x] Implement color-coded health indicators
 - [x] Implement Gantt chart for trace timeline
 - [x] Implement flame graph for trace hierarchy
 - [x] Create flame-graph.js with interactive features
 - [x] Create flame-graph.css for styling
 - [x] Create gantt-chart.js with timeline rendering
 - [x] Create gantt-chart.css for styling
 - [x] Implement critical path highlighting
 - [x] Implement zoom and pan for flame graphs
 - [x] Implement search in flame graphs
 - [x] Implement tooltips for span details

- [x] **Tool/Prompt/Resource Instrumentation**
 - [x] Update ToolService.invoke_tool() to create spans
 - [x] Update PromptService.get_prompt() to create spans
 - [x] Update ResourceService.read_resource() to create spans
 - [x] Set operation attribute (tool_invoke, prompt_render, resource_fetch)
 - [x] Capture operation name, duration, status
 - [x] Handle errors and mark spans as failed

- [x] **Cleanup & Retention**
 - [x] Implement scheduled cleanup job (APScheduler)
 - [x] Delete traces older than retention days
 - [x] Delete oldest traces when max limit exceeded (FIFO)
 - [x] Log cleanup actions (traces deleted, retention stats)
 - [x] Run cleanup hourly or daily

- [x] **Testing**
 - [ ] Unit tests: ObservabilityService CRUD operations (10+ tests)
 - [ ] Unit tests: Trace creation, span creation, span ending (5+ tests)
 - [ ] Unit tests: Metrics aggregation (percentiles, error rates) (5+ tests)
 - [ ] Unit tests: Cleanup logic (retention, max traces) (3+ tests)
 - [ ] Integration tests: Middleware trace creation (5+ tests)
 - [ ] Integration tests: SQLAlchemy instrumentation (5+ tests)
 - [ ] Integration tests: Tool/prompt/resource tracing (10+ tests)
 - [ ] Integration tests: Router endpoints (10+ tests)
 - [ ] Playwright tests: Dashboard navigation (5+ tests)
 - [ ] Playwright tests: Time filter interaction (3+ tests)
 - [ ] Playwright tests: Trace detail view (3+ tests)
 - [ ] Playwright tests: Flame graph interaction (3+ tests)
 - [ ] Performance tests: Query 100K traces (<500ms)
 - [ ] Performance tests: Aggregate metrics with 1M spans (<1s)
 - [ ] Test coverage: 85%+ for observability code

- [x] **Documentation**
 - [x] Create docs/docs/manage/observability/internal-observability.md
 - [x] Document configuration options (.env variables)
 - [x] Document Admin UI dashboards (tools, prompts, resources)
 - [x] Document trace detail view (Gantt chart, flame graph)
 - [x] Document metrics (p50, p90, p95, p99 latencies)
 - [x] Document retention and cleanup policies
 - [x] Document sampling configuration
 - [x] Document path exclusion patterns
 - [x] Add examples: enabling observability
 - [x] Add examples: viewing tool metrics
 - [x] Add examples: analyzing traces
 - [x] Add screenshots of dashboards
 - [x] Update README.md with observability features
 - [x] Update CLAUDE.md with observability usage

- [x] **Code Quality**
 - [x] Run make autoflake isort black pre-commit
 - [x] Run make flake8 bandit interrogate pylint verify
 - [x] Run make lint-web for CSS/JS validation
 - [x] Ensure consistent naming conventions
 - [x] Add docstrings to all public functions
 - [x] Remove debug console.log statements
 - [x] Validate HTML structure
 - [x] Validate CSS (no unused styles)

---

## ⚙️ Configuration

### Environment Variables

```bash
# Enable internal observability
OBSERVABILITY_ENABLED=false

# Automatically trace HTTP requests
OBSERVABILITY_TRACE_HTTP_REQUESTS=true

# Number of days to retain trace data
OBSERVABILITY_TRACE_RETENTION_DAYS=7

# Maximum number of traces to retain (prevents unbounded growth)
OBSERVABILITY_MAX_TRACES=100000

# Trace sampling rate (0.0-1.0) - 1.0 means trace everything, 0.1 means trace 10%
OBSERVABILITY_SAMPLE_RATE=1.0

# Paths to exclude from tracing (comma-separated regex patterns)
OBSERVABILITY_EXCLUDE_PATHS=/health,/healthz,/ready,/metrics,/static/.*

# Enable metrics collection
OBSERVABILITY_METRICS_ENABLED=true

# Enable event logging within spans
OBSERVABILITY_EVENTS_ENABLED=true
```

---

## ✅ Success Criteria

- [x] **Trace Storage**: Traces and spans stored in SQLite/PostgreSQL with proper indexes
- [x] **HTTP Tracing**: All HTTP requests automatically traced (excluding configured paths)
- [x] **Metrics Dashboards**: Tools, prompts, and resources have dedicated metrics dashboards
- [x] **Performance Metrics**: p50, p90, p95, p99 latency percentiles calculated and displayed
- [x] **Error Tracking**: Error rates and error-prone operations identified
- [x] **Trace Visualization**: Gantt charts and flame graphs for trace exploration
- [x] **Interactive UI**: Click-through from metrics to detailed traces
- [x] **Auto-Refresh**: Dashboards update every 60 seconds
- [x] **Retention Policy**: Automatic cleanup of old traces based on configuration
- [x] **Sampling**: Configurable trace sampling to reduce storage costs
- [x] **Path Exclusion**: Health checks and static assets excluded from tracing
- [ ] **Testing**: 85%+ test coverage for observability code
- [x] **Documentation**: Comprehensive guide with examples and screenshots
- [x] **Performance**: Query 100K traces in <500ms, aggregate metrics in <1s

---

## 🏁 Definition of Done

- [x] Database tables created with Alembic migrations
- [x] Performance indexes added for fast queries
- [x] ObservabilityMiddleware implemented and integrated
- [x] SQLAlchemy instrumentation implemented
- [x] ObservabilityService with CRUD and metrics methods
- [x] ObservabilityRouter with all dashboard routes
- [x] Templates for metrics, tools, prompts, resources, traces
- [x] Gantt chart visualization implemented (gantt-chart.js)
- [x] Flame graph visualization implemented (flame-graph.js)
- [x] Summary cards with health indicators
- [x] Chart.js visualizations (usage, latency, error rate)
- [x] Time filter and limit controls functional
- [x] Auto-refresh toggle functional (60s interval)
- [x] Cleanup job for trace retention
- [x] Configuration options in .env.example
- [x] Config.py with validation and defaults
- [ ] Unit tests for service layer (10+ tests)
- [ ] Integration tests for middleware and instrumentation (15+ tests)
- [ ] Playwright tests for UI dashboards (10+ tests)
- [ ] Performance tests for queries and aggregations (3+ tests)
- [x] Documentation: internal-observability.md
- [x] Documentation: configuration.md updated
- [x] README.md updated with observability features
- [x] CLAUDE.md updated with usage instructions
- [x] Code passes make lint-web checks
- [x] Code passes make flake8 bandit interrogate pylint verify
- [x] Screenshots captured for documentation
- [x] Backward compatible: No breaking changes to existing code

---

## 📝 Additional Notes

🔹 **Differences from OpenTelemetry Integration**:
- **Storage**: Local database (SQLite/PostgreSQL) vs. external systems (Phoenix, Jaeger)
- **Infrastructure**: Self-contained vs. requires external observability platform
- **Complexity**: Simpler setup vs. more configuration
- **Use Case**: Development/testing vs. production at scale

🔹 **Performance Considerations**:
- Indexes on trace_id, span_id, operation, start_time for fast queries
- Sampling to reduce trace volume (10-50% in production)
- Automatic cleanup to prevent unbounded storage growth
- Efficient percentile calculations using database aggregations
- Client-side Chart.js rendering to reduce server load

🔹 **Trace Visualization Technologies**:
- **Gantt Chart**: Custom JavaScript with SVG rendering
- **Flame Graph**: Hierarchical SVG visualization with zoom/pan
- **Charts**: Chart.js for bar charts and line charts
- **HTMX**: Dynamic updates without full page reload

🔹 **Metrics Calculation**:
- **Latency Percentiles**: SQL PERCENTILE_CONT function (PostgreSQL) or custom calculation (SQLite)
- **Error Rate**: (COUNT(status='error') / COUNT(*)) * 100
- **Health Status**: Green (<5%), Yellow (5-20%), Red (>20% errors)

🔹 **Future Enhancements**:
- **Distributed Tracing**: Trace propagation across multiple MCP Gateway instances
- **Alerting**: Threshold-based alerts (error rate >10%, latency >1s)
- **Anomaly Detection**: Machine learning for unusual performance patterns
- **Trace Comparison**: Side-by-side comparison of two traces
- **Export to OpenTelemetry**: Bridge to external systems when needed
- **Live Tail**: Real-time trace streaming in Admin UI
- **Custom Dashboards**: User-configurable dashboard widgets
- **SLO Tracking**: Service Level Objective monitoring (p99 <500ms)

🔹 **Security Considerations**:
- Sanitize SQL queries in traces (remove passwords, tokens)
- Sanitize HTTP headers (remove Authorization, Cookie)
- Role-based access control for observability endpoints
- Trace retention limits to prevent storage exhaustion

🔹 **Migration Notes**:
- Run Alembic migrations to create observability tables
- No data migration required (new feature)
- Observability disabled by default (opt-in via OBSERVABILITY_ENABLED=true)
- No breaking changes to existing APIs
- Backward compatible with all existing features

---

## 🔗 Related Issues

- OpenTelemetry integration (external observability platforms)
- Performance optimization and profiling
- Admin UI enhancements
- Database query optimization
- SLO/SLA monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC][OBSERVABILITY]: Internal observability system - Performance monitoring #1401

📊 Epic: Internal Observability System - Performance Monitoring & Trace Analytics

Goal

Why Now?

📖 User Stories

🏗 Architecture

Component Structure

Database Schema

Trace Flow

Metrics Aggregation

📋 Implementation Tasks

⚙️ Configuration

Environment Variables

✅ Success Criteria

🏁 Definition of Done

📝 Additional Notes

🔗 Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC][OBSERVABILITY]: Internal observability system - Performance monitoring #1401

Description

📊 Epic: Internal Observability System - Performance Monitoring & Trace Analytics

Goal

Why Now?

📖 User Stories

🏗 Architecture

Component Structure

Database Schema

Trace Flow

Metrics Aggregation

📋 Implementation Tasks

⚙️ Configuration

Environment Variables

✅ Success Criteria

🏁 Definition of Done

📝 Additional Notes

🔗 Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions