feat: Added support of Open Telemetry tracing for Smart Router#2260
Conversation
Review Summary by QodoAdd OpenTelemetry tracing support for Smart Router with comprehensive instrumentation
WalkthroughsDescription• Adds comprehensive OpenTelemetry tracing support to Smart Router with full SDK initialization and lifecycle management • Implements trace context propagation using W3C standards for HTTP and gRPC protocols • Instruments Smart Router server with spans at relay entry point and throughout the processing pipeline (parse, process, cache lookup, session acquisition) • Adds client-side span instrumentation for JSON-RPC, REST, and gRPC relay operations with request/response details • Records relay attributes including GUID, chain ID, API interface, method, consistency stats, cache results, and retry counts • Provides span helper utilities for error recording, body recording with truncation, and orphan span protection • Includes comprehensive test coverage for span helpers, trace context propagation, and trace manager configuration • Adds --otel-trace-body CLI flag for enabling request/response body recording • Delegates standard OpenTelemetry configuration to environment variables per SDK specification • Updates dependencies to include OpenTelemetry core, SDK, OTLP exporters, and related packages Diagramflowchart LR
A["Smart Router Startup"] -->|Initialize| B["TraceManager"]
B -->|Configure| C["OTel SDK"]
C -->|Create| D["Tracer Provider"]
E["Relay Request"] -->|Extract Context| F["Server Span"]
F -->|Record Attributes| G["Relay Data"]
F -->|Create Child Spans| H["Internal Operations"]
H -->|Parse/Process/Cache| I["Operation Spans"]
J["Direct RPC Relay"] -->|Inject Context| K["Client Span"]
K -->|Record Details| L["Request/Response"]
L -->|Propagate| M["W3C Headers"]
N["Shutdown"] -->|Graceful| O["Trace Manager"]
File Changes1. protocol/tracing/span_helpers_test.go
|
Code Review by Qodo
|
| traceCfg := tracing.TraceConfig{ | ||
| TraceBody: viper.GetBool(tracing.TraceBodyFlag), | ||
| } | ||
| traceManager, err := tracing.New(ctx, traceCfg) |
There was a problem hiding this comment.
1. Trace body flag ignored 🐞 Bug ≡ Correctness
CreateRPCSmartRouterCobraCommand reads TraceConfig.TraceBody via viper.GetBool("otel-trace-body")
even though the flag is only registered on Cobra, so --otel-trace-body won’t enable body recording
at runtime.
Agent Prompt
### Issue description
`--otel-trace-body` is registered as a Cobra flag but `TraceBody` is read via `viper.GetBool(tracing.TraceBodyFlag)` without any `viper.BindPFlag` call (or direct `cmd.Flags().GetBool`). As a result, passing `--otel-trace-body` won’t actually enable request/response body recording.
### Issue Context
In this command, some Cobra flags are explicitly bound into viper via `viper.BindPFlag(...)` (e.g. provider optimizer weights). `otel-trace-body` needs the same treatment, or should be read directly from Cobra flags.
### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[1354-1367]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1482-1494]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1403-1423]
### Suggested fix
After registering the flag, add:
```go
if err := viper.BindPFlag(tracing.TraceBodyFlag, cmdRPCSmartRouter.Flags().Lookup(tracing.TraceBodyFlag)); err != nil {
utils.LavaFormatFatal("failed binding otel trace body flag", err)
}
```
Alternatively, in `RunE` read it from Cobra:
```go
traceBody, _ := cmd.Flags().GetBool(tracing.TraceBodyFlag)
traceCfg := tracing.TraceConfig{ TraceBody: traceBody }
```
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Commit 13bd410 addressed this comment by binding the newly registered otel trace body Cobra flag to Viper with viper.BindPFlag, ensuring viper.GetBool(tracing.TraceBodyFlag) now reflects the --otel-trace-body setting at runtime.
NadavLevi
left a comment
There was a problem hiding this comment.
Review of the OTel tracing changes — 5 inline comments. Two are functional bugs (body flag dead, GUID forced to 0); the other three are smaller correctness/perf notes.
| // because it's a per-invocation debug toggle rather than deployment | ||
| // configuration. Body size is delegated to the SDK via | ||
| // OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT (SDK default: unlimited). | ||
| cmdRPCSmartRouter.Flags().Bool(tracing.TraceBodyFlag, false, "record request/response bodies on trace spans (size limit delegated to OTel SDK via OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT)") |
There was a problem hiding this comment.
--otel-trace-body is registered on Cobra here but read via viper.GetBool(tracing.TraceBodyFlag) at line 1477 without a corresponding viper.BindPFlag. Other flags in this file explicitly bind to viper (lines 1515, 1531–1543, 1593). As-is, viper returns the default (false) regardless of CLI input, so the flag is a no-op.
Suggested fix — add right after the Flags().Bool(...) registration:
if err := viper.BindPFlag(tracing.TraceBodyFlag, cmdRPCSmartRouter.Flags().Lookup(tracing.TraceBodyFlag)); err != nil {
log.Fatalln("unable to bind otel-trace-body flag", err)
}There was a problem hiding this comment.
Commit 13bd410 addressed this comment by binding the tracing.TraceBodyFlag Cobra flag to Viper right after it is registered, ensuring viper.GetBool(tracing.TraceBodyFlag) reflects CLI input instead of always returning the default.
| goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid) | ||
| } | ||
| guid, _ := utils.GetUniqueIdentifier(ctx) | ||
| goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid) |
There was a problem hiding this comment.
Behavior change in this PR: utils.WithUniqueIdentifier (utils/uniqueIdentifier.go:14) writes unconditionally, so when GetUniqueIdentifier returns (0, false) this line stores uint64(0) into goroutineCtx.
Downstream, utils.AppendUniqueIdentifier (utils/uniqueIdentifier.go:19) skips when ctx.Value(...) != nil — uint64(0) satisfies that — so real GUID generation later in the pipeline is blocked. Previously the write was gated by if found.
Suggested fix — restore the guard and still capture guid for the span (0 is fine as 'unknown'):
guid, found := utils.GetUniqueIdentifier(ctx)
if found {
goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid)
}
// ...
tracing.RecordProviderAttributes(provSpan, guid, endpointAddress)There was a problem hiding this comment.
Commit 13bd410 addressed this comment by reintroducing the guard around WithUniqueIdentifier, so the goroutine context only inherits identifiers when GetUniqueIdentifier actually found a GUID and no longer blocks downstream GUID generation due to a stored zero value.
| } | ||
|
|
||
| ctx, processSpan := tracing.StartInternalSpan(ctx, tracing.SpanProcessingResult) | ||
| defer processSpan.End() |
There was a problem hiding this comment.
processSpan is named SpanProcessingResult but defer processSpan.End() at this scope keeps it open until SendParsedRelay returns — so the measured duration includes appendHeadersToRelayResult, analytics assignment, and LogRelay, not just ProcessingResult(). The span's timing no longer matches its name.
Suggested fix — scope the span to just the ProcessingResult call:
returnedResult, err := func() (*common.RelayResult, error) {
_, processSpan := tracing.StartInternalSpan(ctx, tracing.SpanProcessingResult)
defer processSpan.End()
r, e := relayProcessor.ProcessingResult()
if r != nil && r.Reply != nil {
tracing.RecordBody(processSpan, tracing.AttrRelayResponseBody, r.Reply.Data)
}
return r, e
}()There was a problem hiding this comment.
Commit 13bd410 addressed this comment by scoping processSpan to an inline function that only wraps relayProcessor.ProcessingResult, ensuring the span ends before the subsequent logging and analytics operations.
| cacheLatencyMs := float64(time.Since(cacheStart).Milliseconds()) | ||
| cacheHit := cacheError == nil && cacheReply != nil && cacheReply.GetReply() != nil | ||
| tracing.RecordCacheResult(ctx, cacheSpan, cacheHit, cacheLatencyMs) | ||
| cacheSpan.End() |
There was a problem hiding this comment.
Minor: cacheSpan.End() is called explicitly rather than deferred. If anything between line 1584 and here panics, the span leaks. Defensive nit — prefer defer cacheSpan.End() right after the span is started (ideally wrapping this whole block in a helper or IIFE so the defer fires on any exit).
There was a problem hiding this comment.
Commit 13bd410 addressed this comment by wrapping the cache lookup logic in a helper closure that defers cacheSpan.End(), ensuring the span always closes even if an interrupt occurs.
| // by the SDK's SpanLimits, which reads OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT | ||
| // (default: unlimited). Operators who want a body cap should set that env var. | ||
| func RecordBody(span trace.Span, attrKey string, body []byte) { | ||
| if traceBodyEnabled && span.IsRecording() { |
There was a problem hiding this comment.
Perf note: string(body) copies the full byte slice before SetAttributes applies the SDK's OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT truncation. For a large response body with a small configured limit, the full payload is still allocated only to be thrown away.
Suggested fix — read the env var once during New() and pre-truncate here:
var bodyAttrLimit int // set from OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT, 0 = unlimited
func RecordBody(span trace.Span, attrKey string, body []byte) {
if !traceBodyEnabled || !span.IsRecording() {
return
}
if bodyAttrLimit > 0 && len(body) > bodyAttrLimit {
body = body[:bodyAttrLimit]
}
span.SetAttributes(attribute.String(attrKey, string(body)))
}There was a problem hiding this comment.
Commit 13bd410 addressed this comment by snapshotting the SDK span attribute value limit during startup and pre-truncating the body slice in RecordBody before converting it to a string, preventing large allocations when OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT is low.
13bd410 to
c7b96d2
Compare
There was a problem hiding this comment.
Issue: If the inbound metadata carries both "traceparent" and "Traceparent" (case mismatch), only the first is updated; the second remains with whatever stale value it
had. The whole point of replace-in-place was to avoid duplicate traceparent downstream — this case slips through.
Suggestion: track which positions matched and either overwrite both with the new value or null out the duplicates. Or, since this is a defensive measure for an unlikely
upstream bug, just collapse case-folded duplicates before injecting.
There was a problem hiding this comment.
Commit 6f4077c addressed this comment by tracking every canonical header that was refreshed and updating any additional case-variant occurrences before re-appending the remaining carrier keys. The new seen map ensures both “traceparent” and “Traceparent” (and other duplicates) now receive the fresh value and are never re-added later, preventing stale duplicates.
There was a problem hiding this comment.
**Issue: **[]byte(req) allocates a fresh copy of the entire request string before RecordBody is even called. RecordBody's early-return on traceBodyEnabled happens after the
conversion has already cost an allocation + memcpy of the full body. For high-RPS deployments with non-trivial request sizes, this is a measurable hit purely from
instrumentation that the operator hasn't opted into.
tracing.IsTraceBodyEnabled() already exists for exactly this reason — it's documented as the "instrumentation-site check" — but no call site uses it.
Suggestion (option A — gate at call site):
if tracing.IsTraceBodyEnabled() {
tracing.RecordBody(span, tracing.AttrRelayRequestBody, []byte(req))
}
Suggestion (option B — preferred, also avoids the inner string(body) round-trip): add a string-typed variant in span_helpers.go:
func RecordBodyString(span trace.Span, attrKey, body string) {
if !traceBodyEnabled || !span.IsRecording() {
return
}
if bodyAttrLimit > 0 && len(body) > bodyAttrLimit {
body = body[:bodyAttrLimit]
}
span.SetAttributes(attribute.String(attrKey, body))
}
and use it here: tracing.RecordBodyString(span, tracing.AttrRelayRequestBody, req). Keeps the existing RecordBody([]byte) for callers (e.g. r.Reply.Data) that genuinely
have bytes.
There was a problem hiding this comment.
Commit 6f4077c addressed this comment by wrapping the relay request/response body instrumentation in tracing.IsTraceBodyEnabled so we never call RecordBody (or allocate []byte) when --otel-trace-body is off, and the span helper now defers the flag check to callers for the same reason.
…s, optimized tracing request/response body
6f4077c to
d52980d
Compare
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...