Skip to content

Add OpenTelemetry tracing for STDIO, MCP methods, and tool execution #1633

@ShubhamPalriwala

Description

@ShubhamPalriwala

Prerequisites

What are you trying to do that currently feels hard or impossible?

Critical gaps in OpenTelemetry instrumentation prevent production observability:

1. Debug STDIO client connections (Claude Desktop, MCP Inspector)

  • STDIO transport has 0% instrumentation coverage
  • Cannot trace connection lifecycle or message flow
  • Cannot measure performance or see errors
  • Primary desktop client transport is a complete blind spot

2. Measure MCP protocol performance

  • Cannot distinguish between MCP methods (initialize, tools/list, tools/call, ping)
  • MCP tool invocations don't record metrics while Native API does (api.go:154)
  • Majority of production usage is invisible in monitoring dashboards

3. Analyze tool execution performance

  • No per-tool tracing for all the third-party tools
  • Cannot measure individual tool execution time
  • Cannot identify slow performance blocker
  • Cannot trace external API calls from tools

4. Track authentication failures

  • No tracing for token validation, claims extraction, or authorization checks
  • Cannot measure auth overhead or debug auth-related failures

Current Coverage:

Transport Traces Metrics Coverage
Native 100%
HTTP/SSE ⚠️ Partial ⚠️ Partial ~60%
STDIO ❌ None ❌ None 0%

Impact: Cannot effectively monitor, debug, or optimize production deployments. Server-side metrics are essential for SLOs/SLIs but currently missing for MCP flows.

Suggested Solution(s)

Implement comprehensive OpenTelemetry instrumentation across Toolbox:

STDIO Transport Instrumentation

  • Add connection lifecycle spans in ServeStdio()
  • Add message read/write tracing in readInputStream()
  • Add STDIO-specific metrics (connections, messages, size)

MCP Method-Level Tracing

  • Pass instrumentation object to MCP method handlers
  • Add spans for each method: initialize, tools/list, tools/call, ping
  • Critical fix: Add instrumentation.ToolInvoke.Add() in toolsCallHandler() (currently missing)
  • Apply consistently to all 3 MCP protocol versions (v20241105, v20250326, v20250618)

Authentication Tracing

  • Add spans for token validation, claims extraction, authorization checks
  • Add auth success/failure metrics

Tool Interface Changes

  • Add tracer field to Tool interface
  • Instrument third-party tools
  • Performance validation
  • Enhanced error recording with structured attributes
  • Performance metrics (latency histograms, size metrics)
  • Documentation, testing utilities, and monitoring dashboards

Success Criteria:

  • STDIO transport: 0% → 100% instrumented
  • MCP metrics achieve parity with Native API
  • All MCP methods create distinct, traceable spans
  • All third-party tools instrumented with execution tracing
  • Performance overhead < 5%

Alternatives Considered

Client-side instrumentation only

  • Why rejected: Cannot measure operations from non-instrumented clients; server-side metrics required for SLOs/SLIs

Sampling-only approach

  • Why rejected: Metrics require 100% coverage for accuracy; sampling useful for traces but spans must exist first

Manual logging instead of OpenTelemetry

  • Why rejected: Already using OpenTelemetry SDK; need distributed tracing with correlation, not just logs; industry-standard exporters

Current workarounds:

  • Manual log analysis (time-consuming, incomplete)
  • Client-side metrics only (misses server-side operations)
  • HTTP API testing to infer MCP behavior (unreliable)

Additional Details

Root Cause Analysis:

  • Native API handlers have direct access to s.instrumentation (working correctly)
  • MCP method handlers are at protocol layer without instrumentation access
  • STDIO session lifecycle not traced: ServeStdio() at server.go:372 creates no span

Code Locations:

  • Telemetry: internal/telemetry/instrumentation.go
  • Native API: internal/server/api.go:154 (metrics working)
  • MCP Handlers: internal/server/mcp/v*/method.go:163 (metrics missing)
  • STDIO: internal/server/server.go:372 (no tracing)

Metadata

Metadata

Assignees

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions