Skip to content

[EPIC][OBSERVABILITY]: Internal observability system - Performance monitoring #1401

@crivetimihai

Description

@crivetimihai

📊 Epic: Internal Observability System - Performance Monitoring & Trace Analytics

Goal

Implement a comprehensive, self-contained observability system that provides performance monitoring, error tracking, and trace analytics for MCP Gateway operations without requiring external platforms. Enable operators to monitor tool invocations, prompt rendering, resource fetching, and HTTP requests through an integrated Admin UI with interactive visualizations, detailed metrics, and trace exploration capabilities.

Why Now?

As ContextForge deployments scale to production environments with hundreds of tools, prompts, and resources serving multiple clients, operators need visibility into:

  1. Performance Bottlenecks: Which tools/prompts/resources are slow? What's the p99 latency?
  2. Error Tracking: Which operations are failing? What's the error rate trend?
  3. Usage Patterns: Which tools are most frequently invoked? Which prompts are rarely used?
  4. Request Tracing: Complete visibility into HTTP request/response cycles with timing breakdown
  5. Root Cause Analysis: Detailed trace inspection with spans, events, and attributes
  6. Self-Hosted Solution: No dependency on external observability platforms (Phoenix, Jaeger, etc.)

Current limitations create operational blind spots:

  • No visibility into tool invocation performance
  • No error rate tracking for prompts and resources
  • No request tracing for debugging
  • Dependency on external observability platforms increases infrastructure complexity
  • No historical performance data for trend analysis

By implementing an internal observability system with database-backed storage and Admin UI visualizations, operators gain complete performance visibility while maintaining infrastructure simplicity.


📖 User Stories

US-1: SRE - Monitor Tool Performance Metrics

As an SRE
I want to view performance metrics for all MCP tools
So that I can identify slow tools and optimize critical paths

Acceptance Criteria:

Given observability is enabled (OBSERVABILITY_ENABLED=true)
And tools have been invoked multiple times

When I navigate to /admin/observability/tools
Then I should see a dashboard with:
  - Summary cards showing:
    - Overall health (success rate %)
    - Most used tool (by invocation count)
    - Slowest tool (by p99 latency)
    - Most error-prone tool (by error rate %)
  
  - Performance charts:
    - Tool usage bar chart (invocation counts)
    - Average latency bar chart (milliseconds)
    - Error rate chart (percentage with color coding)
    - Top N error-prone tools chart
  
  - Detailed metrics table with columns:
    - Tool Name
    - Invocation Count
    - Latency Percentiles (p50, p90, p95, p99)
    - Error Rate (%)
    - Last Used timestamp

When I filter by "Last 1 hour"
Then only metrics from the past hour should display

When I select "Top 50" limit
Then the table should show at most 50 tools

When I enable auto-refresh
Then the dashboard should update every 60 seconds

Technical Requirements:

  • Query traces table with aggregations (COUNT, AVG, percentile calculations)
  • Color-coded health indicators: green (<5% errors), yellow (5-20%), red (>20%)
  • Chart.js for interactive visualizations
  • HTMX for auto-refresh without page reload
  • SQL performance indexes for fast queries
US-2: Developer - View Detailed Request Traces

As a Developer
I want to view detailed traces for individual HTTP requests
So that I can debug performance issues and understand request flow

Acceptance Criteria:

Given HTTP request tracing is enabled (OBSERVABILITY_TRACE_HTTP_REQUESTS=true)
And I made a POST request to /mcp/tools/invoke

When I navigate to /admin/observability/metrics
And I click on a trace in the "Recent Traces" list

Then I should see a trace detail page showing:
  - Trace metadata:
    - Trace ID
    - Start time and end time
    - Total duration (milliseconds)
    - Status (success/error)
  
  - Gantt chart timeline:
    - Visual timeline of all spans
    - Span hierarchy (parent-child relationships)
    - Duration bars with relative positioning
    - Critical path highlighting
  
  - Flame graph:
    - Hierarchical visualization of nested operations
    - Color-coded by operation type
    - Interactive zoom and search
    - Time-proportional rectangles
  
  - Span details table:
    - Span name and operation
    - Start time and duration
    - Status and attributes
    - Events (if any)

When I click on a span in the Gantt chart
Then the span details should scroll into view and highlight

When I search for "database" in the flame graph
Then matching spans should be highlighted

Technical Requirements:

  • Store traces with spans, events, and attributes in SQLite/PostgreSQL
  • Generate trace IDs and span IDs (UUIDs)
  • Implement Gantt chart with D3.js or vanilla JS
  • Implement flame graph with hierarchical rendering
  • Support span parent-child relationships
  • Calculate critical path (longest dependency chain)
US-3: Platform Admin - Configure Observability Settings

As a Platform Administrator
I want to configure observability retention, sampling, and exclusions
So that I can control storage costs and trace noise

Acceptance Criteria:

Given I have access to .env configuration

When I set:
  OBSERVABILITY_ENABLED=true
  OBSERVABILITY_TRACE_RETENTION_DAYS=7
  OBSERVABILITY_MAX_TRACES=100000
  OBSERVABILITY_SAMPLE_RATE=0.5
  OBSERVABILITY_EXCLUDE_PATHS=/health,/metrics,/static/.*

Then the observability system should:
  - Automatically delete traces older than 7 days
  - Limit total traces to 100,000 (FIFO deletion when exceeded)
  - Sample 50% of requests (randomly trace half of all requests)
  - Exclude /health, /metrics, and /static/* paths from tracing

When I restart MCP Gateway
Then the new configuration should take effect

When I check the database
Then I should see:
  - No traces older than 7 days
  - Total trace count ≤ 100,000
  - No traces for /health, /metrics, or /static/* paths

Technical Requirements:

  • Automatic cleanup job for trace retention
  • FIFO trace deletion when max limit exceeded
  • Probabilistic sampling based on sample rate
  • Regex-based path exclusion matching
  • Validation: retention 1-365 days, sample rate 0.0-1.0
  • Environment variable parsing and defaults
US-4: SRE - Monitor Prompt Rendering Performance

As an SRE
I want to view performance metrics for MCP prompt rendering
So that I can identify slow prompts and optimize template rendering

Acceptance Criteria:

Given prompts have been rendered multiple times

When I navigate to /admin/observability/prompts
Then I should see:
  - Summary cards:
    - Overall rendering health (success rate)
    - Most used prompt
    - Slowest prompt (p99 latency)
    - Most error-prone prompt
  
  - Charts:
    - Prompt render frequency (usage distribution)
    - Average latency per prompt
    - Error rate percentage
    - Top N error-prone prompts
  
  - Metrics table:
    - Prompt Name
    - Render Count
    - Latency Percentiles (p50, p90, p95, p99)
    - Error Rate (%)
    - Last Rendered timestamp

When I filter by "Last 24 hours"
Then only recent prompt metrics should display

When a prompt has >20% error rate
Then it should be highlighted in red

Technical Requirements:

  • Track prompt_get_prompt operations via instrumentation
  • Store prompt name, duration, status in traces/spans
  • Aggregate metrics with percentile calculations
  • Color-coded health indicators
  • Auto-refresh support
US-5: Developer - Monitor Resource Fetch Performance

As a Developer
I want to view performance metrics for MCP resource fetching
So that I can optimize resource loading and identify slow resources

Acceptance Criteria:

Given resources have been fetched multiple times

When I navigate to /admin/observability/resources
Then I should see:
  - Summary cards:
    - Overall fetch health
    - Most used resource (by fetch count)
    - Slowest resource (p99 latency)
    - Most error-prone resource
  
  - Charts:
    - Resource fetch frequency
    - Average fetch latency
    - Error rate distribution
    - Top N error-prone resources
  
  - Metrics table:
    - Resource URI
    - Fetch Count
    - Latency Percentiles (p50, p90, p95, p99)
    - Error Rate (%)
    - Last Fetched timestamp

When I click on a resource URI
Then I should see detailed traces for that resource

When I filter by time range "Last 7 days"
Then metrics should aggregate data from the past week

Technical Requirements:

  • Track resource_read operations via instrumentation
  • Store resource URI, duration, status in spans
  • Support filtering and aggregation
  • Link to detailed trace views
US-6: Operations Engineer - Analyze Error Trends

As an Operations Engineer
I want to identify error-prone operations and view error trends
So that I can proactively fix reliability issues

Acceptance Criteria:

Given multiple operations have failed

When I navigate to /admin/observability/metrics
Then I should see:
  - Overall error rate percentage
  - List of top N error-prone operations:
    - Operation name
    - Error count
    - Error rate percentage
    - Example error message
  
  - Recent failed traces:
    - Trace ID (clickable)
    - Operation name
    - Timestamp
    - Error message
    - Duration

When I click on a failed trace
Then I should see:
  - Full trace details with all spans
  - Error status highlighted
  - Error message and stack trace (if available)
  - Span attributes showing error context

When an operation has >50% error rate
Then it should be highlighted with a critical alert color

Technical Requirements:

  • Query traces with status = "error"
  • Aggregate error counts per operation
  • Calculate error rate percentage
  • Display error messages and attributes
  • Highlight critical error rates (>50%)
US-7: Developer - Visualize Request Flow with Flame Graphs

As a Developer
I want to visualize request execution flow using flame graphs
So that I can identify nested operation bottlenecks

Acceptance Criteria:

Given a trace has nested spans (parent-child hierarchy)

When I navigate to a trace detail page
And I view the flame graph section

Then I should see:
  - Hierarchical flame graph with:
    - Root span at the bottom
    - Child spans stacked above parents
    - Width proportional to duration
    - Color-coded by operation type
  
  - Interactive features:
    - Hover to see span details (name, duration)
    - Click to zoom into span
    - Search box to find spans by name
    - Reset zoom button
  
  - Legend showing:
    - Color mapping for operation types
    - Total trace duration
    - Number of spans

When I search for "database"
Then all spans containing "database" should be highlighted

When I click on a span
Then the view should zoom to show that span and its children

When I click "Reset"
Then the view should zoom out to show the full trace

Technical Requirements:

  • Render flame graph using SVG or Canvas
  • Calculate hierarchical layout (parent-child positioning)
  • Proportional width based on span duration
  • Interactive zoom and pan
  • Search with highlighting
  • Color palette for operation types
US-8: SRE - View Request Timeline with Gantt Charts

As an SRE
I want to view request execution timeline using Gantt charts
So that I can understand operation sequencing and parallelism

Acceptance Criteria:

Given a trace has multiple spans with start/end times

When I view the trace detail page
And I view the Gantt chart section

Then I should see:
  - Timeline with:
    - Time scale (x-axis) from trace start to end
    - Span rows (y-axis) showing each operation
    - Duration bars positioned by start time
    - Color-coded by span status (success/error)
  
  - Critical path highlighting:
    - Longest dependency chain highlighted
    - Critical spans marked with red border
  
  - Interactive features:
    - Hover to see span details
    - Click to view span attributes
    - Zoom and pan timeline
    - Expand/collapse nested spans

When I hover over a span bar
Then I should see a tooltip with:
  - Span name
  - Start time (relative to trace start)
  - Duration (milliseconds)
  - Status

When I click on a span bar
Then the span details table should scroll to that span

Technical Requirements:

  • Render Gantt chart using SVG or Canvas
  • Calculate relative positioning (start time offset)
  • Proportional bar width based on duration
  • Critical path calculation (longest chain)
  • Interactive tooltip and click handlers
  • Expand/collapse for nested spans
US-9: Platform Admin - Auto-Cleanup Old Traces

As a Platform Administrator
I want automatic cleanup of old traces based on retention policy
So that database storage doesn't grow unbounded

Acceptance Criteria:

Given OBSERVABILITY_TRACE_RETENTION_DAYS=7
And OBSERVABILITY_MAX_TRACES=100000

When the cleanup job runs (hourly or daily)
Then it should:
  - Delete traces older than 7 days
  - Delete oldest traces if total count > 100,000
  - Log cleanup actions (traces deleted, retention applied)

When I query the traces table
Then I should see:
  - No traces with created_at > 7 days ago
  - Total trace count ≤ 100,000

When the database has 150,000 traces
Then the cleanup job should:
  - Delete the oldest 50,000 traces (FIFO)
  - Keep the newest 100,000 traces

Technical Requirements:

  • Scheduled cleanup job (APScheduler or cron)
  • DELETE query with timestamp filter
  • FIFO deletion (ORDER BY created_at ASC LIMIT N)
  • Logging with trace count and retention stats
  • Configurable cleanup interval
US-10: Developer - SQLAlchemy Query Instrumentation

As a Developer
I want SQLAlchemy queries to be automatically traced
So that I can see database operation performance in traces

Acceptance Criteria:

Given SQLAlchemy instrumentation is enabled

When I invoke a tool that queries the database
Then the trace should include spans for:
  - SQL query execution
  - Query duration (milliseconds)
  - Query text (sanitized)
  - Table name
  - Row count (if applicable)

When I view the trace detail
Then I should see:
  - Span name: "db.query" or "db.execute"
  - Attributes:
    - db.statement: SQL query text
    - db.system: "sqlite" or "postgresql"
    - db.operation: "SELECT" or "INSERT" or "UPDATE"
  - Duration showing query execution time

When a query takes >1000ms
Then the span should be highlighted as slow

Technical Requirements:

  • SQLAlchemy event listeners (before_cursor_execute, after_cursor_execute)
  • Create spans for queries
  • Capture query text, parameters, duration
  • Sanitize sensitive data (passwords, tokens)
  • Set span attributes with db.* namespace

🏗 Architecture

Component Structure

Observability System
├── Middleware
│   ├── ObservabilityMiddleware (HTTP request tracing)
│   └── AuthMiddleware (user context propagation)
│
├── Instrumentation
│   ├── SQLAlchemy (database query tracing)
│   └── Custom decorators (tool/prompt/resource tracing)
│
├── Storage Layer
│   ├── Traces table (trace metadata)
│   ├── Spans table (operation details)
│   ├── Span Events table (log events)
│   └── Span Attributes table (key-value pairs)
│
├── Service Layer
│   ├── ObservabilityService (trace CRUD, metrics aggregation)
│   └── Cleanup service (retention enforcement)
│
├── Router Layer
│   └── ObservabilityRouter (/admin/observability/*)
│
└── UI Layer
    ├── Metrics dashboard (summary cards, charts)
    ├── Tools dashboard (tool performance)
    ├── Prompts dashboard (prompt performance)
    ├── Resources dashboard (resource performance)
    ├── Trace list (recent traces)
    ├── Trace detail (Gantt chart, flame graph, spans)
    └── Interactive visualizations (Chart.js, custom SVG)

Database Schema

CREATE TABLE observability_traces (
    id INTEGER PRIMARY KEY,
    trace_id TEXT UNIQUE NOT NULL,
    name TEXT NOT NULL,
    start_time REAL NOT NULL,
    end_time REAL,
    duration_ms REAL,
    status TEXT,  -- success|error
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_trace_id (trace_id),
    INDEX idx_start_time (start_time),
    INDEX idx_status (status)
);

CREATE TABLE observability_spans (
    id INTEGER PRIMARY KEY,
    trace_id TEXT NOT NULL,
    span_id TEXT UNIQUE NOT NULL,
    parent_span_id TEXT,
    name TEXT NOT NULL,
    start_time REAL NOT NULL,
    end_time REAL,
    duration_ms REAL,
    status TEXT,
    operation TEXT,  -- tool_invoke|prompt_render|resource_fetch|http_request
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (trace_id) REFERENCES observability_traces(trace_id),
    INDEX idx_trace_id (trace_id),
    INDEX idx_span_id (span_id),
    INDEX idx_operation (operation)
);

CREATE TABLE observability_span_attributes (
    id INTEGER PRIMARY KEY,
    span_id TEXT NOT NULL,
    key TEXT NOT NULL,
    value TEXT,
    FOREIGN KEY (span_id) REFERENCES observability_spans(span_id),
    INDEX idx_span_id (span_id)
);

CREATE TABLE observability_span_events (
    id INTEGER PRIMARY KEY,
    span_id TEXT NOT NULL,
    timestamp REAL NOT NULL,
    name TEXT NOT NULL,
    attributes TEXT,  -- JSON
    FOREIGN KEY (span_id) REFERENCES observability_spans(span_id),
    INDEX idx_span_id (span_id)
);

Trace Flow

  1. HTTP Request → ObservabilityMiddleware creates trace + root span
  2. Tool Invocation → ToolService creates child span (operation: tool_invoke)
  3. Database Query → SQLAlchemy instrumentation creates child span (operation: db.query)
  4. Response → Middleware ends root span, calculates duration, stores trace

Metrics Aggregation

-- Tool metrics (p50, p90, p95, p99 latencies)
SELECT 
    name,
    COUNT(*) as invocation_count,
    AVG(duration_ms) as avg_latency,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) as p50,
    PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY duration_ms) as p90,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99,
    SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as error_rate
FROM observability_spans
WHERE operation = 'tool_invoke'
    AND start_time > <time_filter>
GROUP BY name
ORDER BY invocation_count DESC;

📋 Implementation Tasks

  • Database Schema & Migrations

    • Create observability_traces table
    • Create observability_spans table
    • Create observability_span_attributes table
    • Create observability_span_events table
    • Add performance indexes (trace_id, span_id, operation, start_time)
    • Add saved query views for common metrics
    • Alembic migration: a23a08d61eb0_add_observability_tables.py
    • Alembic migration: i3c4d5e6f7g8_add_observability_performance_indexes.py
    • Alembic migration: j4d5e6f7g8h9_add_observability_saved_queries.py
  • Configuration & Settings

    • Add OBSERVABILITY_ENABLED setting (default: false)
    • Add OBSERVABILITY_TRACE_HTTP_REQUESTS (default: true)
    • Add OBSERVABILITY_TRACE_RETENTION_DAYS (default: 7)
    • Add OBSERVABILITY_MAX_TRACES (default: 100000)
    • Add OBSERVABILITY_SAMPLE_RATE (default: 1.0)
    • Add OBSERVABILITY_EXCLUDE_PATHS (default: /health,/metrics,/static/.*)
    • Add OBSERVABILITY_METRICS_ENABLED (default: true)
    • Add OBSERVABILITY_EVENTS_ENABLED (default: true)
    • Update .env.example with all observability settings
    • Update config.py with validation and defaults
    • Update Helm chart values.yaml with observability config
  • Middleware Implementation

    • Create ObservabilityMiddleware for HTTP request tracing
    • Implement trace creation (trace_id, root span)
    • Implement span creation with start_time
    • Implement span ending with end_time, duration calculation
    • Implement sampling logic based on OBSERVABILITY_SAMPLE_RATE
    • Implement path exclusion regex matching
    • Store request method, path, status_code as span attributes
    • Handle exceptions and mark spans as error status
    • Integrate with FastAPI middleware stack
  • SQLAlchemy Instrumentation

    • Create SQLAlchemy event listeners (before/after execute)
    • Create spans for database queries
    • Capture query text, operation (SELECT/INSERT/UPDATE)
    • Capture query duration
    • Set span attributes (db.statement, db.system, db.operation)
    • Sanitize sensitive data in queries
    • Handle query errors and mark spans as failed
  • Service Layer - ObservabilityService

    • Implement create_trace(name, start_time) → trace_id
    • Implement create_span(trace_id, name, operation, parent_span_id)
    • Implement end_span(span_id, end_time, status, attributes)
    • Implement add_span_event(span_id, name, timestamp, attributes)
    • Implement get_traces(limit, offset, time_filter)
    • Implement get_trace_by_id(trace_id) with spans
    • Implement get_tool_metrics(time_filter, limit)
    • Implement get_prompt_metrics(time_filter, limit)
    • Implement get_resource_metrics(time_filter, limit)
    • Implement cleanup_old_traces(retention_days, max_traces)
    • Implement percentile calculations (p50, p90, p95, p99)
  • Router Layer - ObservabilityRouter

    • Create /admin/observability route
    • Create /admin/observability/metrics route (summary dashboard)
    • Create /admin/observability/tools route (tool metrics)
    • Create /admin/observability/prompts route (prompt metrics)
    • Create /admin/observability/resources route (resource metrics)
    • Create /admin/observability/traces route (trace list)
    • Create /admin/observability/traces/{trace_id} route (trace detail)
    • Implement time filter query params (1h, 24h, 7d, 30d)
    • Implement limit query param (10, 20, 50, 100)
    • Implement auto-refresh query param
  • Templates - Dashboard UI

    • Create observability_metrics.html (summary dashboard)
    • Create observability_tools.html (tool metrics dashboard)
    • Create observability_prompts.html (prompt metrics dashboard)
    • Create observability_resources.html (resource metrics dashboard)
    • Create observability_traces_list.html (recent traces)
    • Create observability_trace_detail.html (trace details)
    • Create observability_partial.html (shared components)
    • Create observability_stats.html (summary cards)
    • Add summary cards (health, most used, slowest, error-prone)
    • Add time filter dropdown (1h, 24h, 7d, 30d)
    • Add limit dropdown (10, 20, 50, 100)
    • Add auto-refresh toggle (60s interval)
    • Add HTMX for dynamic updates
  • Charts & Visualizations

    • Implement Chart.js bar charts (usage, latency, error rate)
    • Implement color-coded health indicators
    • Implement Gantt chart for trace timeline
    • Implement flame graph for trace hierarchy
    • Create flame-graph.js with interactive features
    • Create flame-graph.css for styling
    • Create gantt-chart.js with timeline rendering
    • Create gantt-chart.css for styling
    • Implement critical path highlighting
    • Implement zoom and pan for flame graphs
    • Implement search in flame graphs
    • Implement tooltips for span details
  • Tool/Prompt/Resource Instrumentation

    • Update ToolService.invoke_tool() to create spans
    • Update PromptService.get_prompt() to create spans
    • Update ResourceService.read_resource() to create spans
    • Set operation attribute (tool_invoke, prompt_render, resource_fetch)
    • Capture operation name, duration, status
    • Handle errors and mark spans as failed
  • Cleanup & Retention

    • Implement scheduled cleanup job (APScheduler)
    • Delete traces older than retention days
    • Delete oldest traces when max limit exceeded (FIFO)
    • Log cleanup actions (traces deleted, retention stats)
    • Run cleanup hourly or daily
  • Testing

    • Unit tests: ObservabilityService CRUD operations (10+ tests)
    • Unit tests: Trace creation, span creation, span ending (5+ tests)
    • Unit tests: Metrics aggregation (percentiles, error rates) (5+ tests)
    • Unit tests: Cleanup logic (retention, max traces) (3+ tests)
    • Integration tests: Middleware trace creation (5+ tests)
    • Integration tests: SQLAlchemy instrumentation (5+ tests)
    • Integration tests: Tool/prompt/resource tracing (10+ tests)
    • Integration tests: Router endpoints (10+ tests)
    • Playwright tests: Dashboard navigation (5+ tests)
    • Playwright tests: Time filter interaction (3+ tests)
    • Playwright tests: Trace detail view (3+ tests)
    • Playwright tests: Flame graph interaction (3+ tests)
    • Performance tests: Query 100K traces (<500ms)
    • Performance tests: Aggregate metrics with 1M spans (<1s)
    • Test coverage: 85%+ for observability code
  • Documentation

    • Create docs/docs/manage/observability/internal-observability.md
    • Document configuration options (.env variables)
    • Document Admin UI dashboards (tools, prompts, resources)
    • Document trace detail view (Gantt chart, flame graph)
    • Document metrics (p50, p90, p95, p99 latencies)
    • Document retention and cleanup policies
    • Document sampling configuration
    • Document path exclusion patterns
    • Add examples: enabling observability
    • Add examples: viewing tool metrics
    • Add examples: analyzing traces
    • Add screenshots of dashboards
    • Update README.md with observability features
    • Update CLAUDE.md with observability usage
  • Code Quality

    • Run make autoflake isort black pre-commit
    • Run make flake8 bandit interrogate pylint verify
    • Run make lint-web for CSS/JS validation
    • Ensure consistent naming conventions
    • Add docstrings to all public functions
    • Remove debug console.log statements
    • Validate HTML structure
    • Validate CSS (no unused styles)

⚙️ Configuration

Environment Variables

# Enable internal observability
OBSERVABILITY_ENABLED=false

# Automatically trace HTTP requests
OBSERVABILITY_TRACE_HTTP_REQUESTS=true

# Number of days to retain trace data
OBSERVABILITY_TRACE_RETENTION_DAYS=7

# Maximum number of traces to retain (prevents unbounded growth)
OBSERVABILITY_MAX_TRACES=100000

# Trace sampling rate (0.0-1.0) - 1.0 means trace everything, 0.1 means trace 10%
OBSERVABILITY_SAMPLE_RATE=1.0

# Paths to exclude from tracing (comma-separated regex patterns)
OBSERVABILITY_EXCLUDE_PATHS=/health,/healthz,/ready,/metrics,/static/.*

# Enable metrics collection
OBSERVABILITY_METRICS_ENABLED=true

# Enable event logging within spans
OBSERVABILITY_EVENTS_ENABLED=true

✅ Success Criteria

  • Trace Storage: Traces and spans stored in SQLite/PostgreSQL with proper indexes
  • HTTP Tracing: All HTTP requests automatically traced (excluding configured paths)
  • Metrics Dashboards: Tools, prompts, and resources have dedicated metrics dashboards
  • Performance Metrics: p50, p90, p95, p99 latency percentiles calculated and displayed
  • Error Tracking: Error rates and error-prone operations identified
  • Trace Visualization: Gantt charts and flame graphs for trace exploration
  • Interactive UI: Click-through from metrics to detailed traces
  • Auto-Refresh: Dashboards update every 60 seconds
  • Retention Policy: Automatic cleanup of old traces based on configuration
  • Sampling: Configurable trace sampling to reduce storage costs
  • Path Exclusion: Health checks and static assets excluded from tracing
  • Testing: 85%+ test coverage for observability code
  • Documentation: Comprehensive guide with examples and screenshots
  • Performance: Query 100K traces in <500ms, aggregate metrics in <1s

🏁 Definition of Done

  • Database tables created with Alembic migrations
  • Performance indexes added for fast queries
  • ObservabilityMiddleware implemented and integrated
  • SQLAlchemy instrumentation implemented
  • ObservabilityService with CRUD and metrics methods
  • ObservabilityRouter with all dashboard routes
  • Templates for metrics, tools, prompts, resources, traces
  • Gantt chart visualization implemented (gantt-chart.js)
  • Flame graph visualization implemented (flame-graph.js)
  • Summary cards with health indicators
  • Chart.js visualizations (usage, latency, error rate)
  • Time filter and limit controls functional
  • Auto-refresh toggle functional (60s interval)
  • Cleanup job for trace retention
  • Configuration options in .env.example
  • Config.py with validation and defaults
  • Unit tests for service layer (10+ tests)
  • Integration tests for middleware and instrumentation (15+ tests)
  • Playwright tests for UI dashboards (10+ tests)
  • Performance tests for queries and aggregations (3+ tests)
  • Documentation: internal-observability.md
  • Documentation: configuration.md updated
  • README.md updated with observability features
  • CLAUDE.md updated with usage instructions
  • Code passes make lint-web checks
  • Code passes make flake8 bandit interrogate pylint verify
  • Screenshots captured for documentation
  • Backward compatible: No breaking changes to existing code

📝 Additional Notes

🔹 Differences from OpenTelemetry Integration:

  • Storage: Local database (SQLite/PostgreSQL) vs. external systems (Phoenix, Jaeger)
  • Infrastructure: Self-contained vs. requires external observability platform
  • Complexity: Simpler setup vs. more configuration
  • Use Case: Development/testing vs. production at scale

🔹 Performance Considerations:

  • Indexes on trace_id, span_id, operation, start_time for fast queries
  • Sampling to reduce trace volume (10-50% in production)
  • Automatic cleanup to prevent unbounded storage growth
  • Efficient percentile calculations using database aggregations
  • Client-side Chart.js rendering to reduce server load

🔹 Trace Visualization Technologies:

  • Gantt Chart: Custom JavaScript with SVG rendering
  • Flame Graph: Hierarchical SVG visualization with zoom/pan
  • Charts: Chart.js for bar charts and line charts
  • HTMX: Dynamic updates without full page reload

🔹 Metrics Calculation:

  • Latency Percentiles: SQL PERCENTILE_CONT function (PostgreSQL) or custom calculation (SQLite)
  • Error Rate: (COUNT(status='error') / COUNT(*)) * 100
  • Health Status: Green (<5%), Yellow (5-20%), Red (>20% errors)

🔹 Future Enhancements:

  • Distributed Tracing: Trace propagation across multiple MCP Gateway instances
  • Alerting: Threshold-based alerts (error rate >10%, latency >1s)
  • Anomaly Detection: Machine learning for unusual performance patterns
  • Trace Comparison: Side-by-side comparison of two traces
  • Export to OpenTelemetry: Bridge to external systems when needed
  • Live Tail: Real-time trace streaming in Admin UI
  • Custom Dashboards: User-configurable dashboard widgets
  • SLO Tracking: Service Level Objective monitoring (p99 <500ms)

🔹 Security Considerations:

  • Sanitize SQL queries in traces (remove passwords, tokens)
  • Sanitize HTTP headers (remove Authorization, Cookie)
  • Role-based access control for observability endpoints
  • Trace retention limits to prevent storage exhaustion

🔹 Migration Notes:

  • Run Alembic migrations to create observability tables
  • No data migration required (new feature)
  • Observability disabled by default (opt-in via OBSERVABILITY_ENABLED=true)
  • No breaking changes to existing APIs
  • Backward compatible with all existing features

🔗 Related Issues

  • OpenTelemetry integration (external observability platforms)
  • Performance optimization and profiling
  • Admin UI enhancements
  • Database query optimization
  • SLO/SLA monitoring

Metadata

Metadata

Assignees

Labels

epicLarge feature spanning multiple issuesobservabilityObservability, logging, monitoring

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions