Goal
Coordinate on Langfuse tracing between protoCLI (primary) and the LiteLLM gateway (homelab-iac) so we get a single, complete trace per turn that includes:
- protoCLI's existing structure (turn → agent.execute → tool calls → gen_ai chat)
- Model thinking traces (currently lost on streaming requests)
- Gateway-side processing decisions (normalizer firing, salvage paths, retries, fallbacks)
- Upstream model usage/timing as visible to the gateway
Today, only the first bullet is captured. This issue is to align on which sources write what and how they merge in Langfuse.
What's currently happening
protoCLI (this repo, packages/core/src/telemetry/sdk.ts):
- Initializes OTel SDK with
HttpInstrumentation and a Langfuse-specific BatchSpanProcessor that exports to ${LANGFUSE_BASE_URL}/api/public/otel/v1/traces
service.name: qwen-code on all spans
gen_ai chat ${model} span emitted from packages/core/src/core/openaiContentGenerator/pipeline.ts:569 with attributes for system, model, usage, plus a gen_ai.content.completion event
- W3C
traceparent header is auto-propagated on outbound HTTP via HttpInstrumentation (so gateway could read it)
- pipeline.ts already extracts
delta.reasoning_content and delta.reasoning from chunks (line 156-160) but only for debug logging — not added to the span
LiteLLM gateway (protoLabsAI/homelab-iac, stacks/ai/config/litellm/callbacks/thinking_normalizer.py):
- Strips
<think>...</think> markup from streaming delta.content before forwarding to clients (per protoLabsAI/homelab-iac#26)
- Captures the stripped thinking text and writes to
request_data["metadata"]["thinking"][<choice_idx>] (per protoLabsAI/homelab-iac#28)
- Has
success_callback: ["langfuse", "prometheus"] configured
- But: I audited 500 most recent generations in Langfuse — 100% have
service.name: qwen-code (protoCLI). Zero come from LiteLLM's langfuse callback. The gateway-written metadata.thinking never reaches any visible trace.
Why this is happening
The PR that added metadata.thinking was verified against a direct curl probe where LiteLLM's langfuse callback was the only writer — and it worked. But for actual protoCLI traffic, protoCLI's OTel SDK writes to Langfuse directly, bypassing whatever LiteLLM's callback does. The two paths don't merge.
What full coverage would look like
A single Langfuse trace per protoCLI turn containing:
turn (protoCLI)
├── agent.execute
│ ├── tool/<name>
│ ├── tool/<name>
│ └── gen_ai chat protolabs/smart ← currently captured by protoCLI
│ ├── attributes:
│ │ ├── gen_ai.system, gen_ai.request.model, ... [present]
│ │ ├── gen_ai.usage.{input,output,total}_tokens [present]
│ │ └── gen_ai.response.thinking (e.g., 384 chars) ← MISSING
│ ├── child span: gateway.normalize ← MISSING
│ │ ├── normalizer.thinking_captured (bool)
│ │ ├── normalizer.unclosed_think_salvaged (bool)
│ │ └── normalizer.duration_ms
│ └── child span: gateway.upstream.vllm ← MISSING
│ └── upstream.duration_ms, upstream.first_token_ms
Three architecture options
Option A: Gateway returns thinking via response, protoCLI captures it as span attribute
- Gateway adds a custom HTTP response header (e.g.,
x-llm-thinking) or appends a final SSE chunk with structured metadata
pipeline.ts's stream handler already has the stripped delta.reasoning_content / delta.reasoning available — extend it to also read the gateway-emitted thinking and set gen_ai.response.thinking as a span attribute
- Single source (protoCLI), single trace, no merge needed
Pros: simplest from the trace-merging perspective; one writer.
Cons: custom response shape; loses gateway-internal observability (normalizer decisions, upstream timing).
Option B: Gateway emits its own OTel spans to the same Langfuse endpoint, joined via traceparent
- protoCLI already auto-propagates
traceparent via HttpInstrumentation
- LiteLLM gateway adds OTel SDK + reads incoming
traceparent, emits child spans under the same trace_id to Langfuse OTel endpoint
- Spans:
gateway.normalize (with thinking as attribute), gateway.upstream (with vLLM timing), etc.
- Both protoCLI and gateway write to Langfuse — Langfuse merges by trace_id
Pros: clean parent-child structure; both sides own their data; no custom response shape.
Cons: gateway needs OTel instrumentation added (more code); needs LiteLLM's langfuse callback disabled to avoid double-writing.
Option C: Status quo + small protoCLI capture of available reasoning fields
- Gateway keeps stripping markup but also passes through a small
reasoning_content field on the final chunk (via the OpenAI-extension reasoning field convention)
- pipeline.ts is updated to add
gen_ai.response.reasoning span attribute when present
- Doesn't capture gateway-internal decisions
Pros: smallest change, ships fast.
Cons: still single-source; loses gateway internals; works around current architecture rather than fixing it.
Proposed direction
Recommend Option B for completeness, with Option C as a stepping stone if Option B is too much scope.
Coordination needed:
| Side |
Change |
| protoCLI |
Verify W3C traceparent propagation reaches the gateway (should be automatic via HttpInstrumentation; confirm with a debug request). Add span attributes for thinking when present (Option C step). |
| homelab-iac (gateway) |
Add OTel SDK to LiteLLM container, configure to export to same Langfuse OTel endpoint (Option B). Replace current metadata.thinking write with a gateway span attribute. Disable LiteLLM's langfuse callback to avoid duplicate writes. |
| Both |
Decide canonical attribute names: gen_ai.response.thinking? llm.reasoning_content? Pick one and document. |
Happy to drive the gateway side. Looking for a protoCLI maintainer to:
- Confirm/document the current OTel setup (which fields are captured, sample trace, what's the canonical attribute schema)
- Discuss whether direction B is right or if there's a different roadmap protoCLI is already pursuing
- Pick attribute names so we don't drift
Related context
Goal
Coordinate on Langfuse tracing between protoCLI (primary) and the LiteLLM gateway (homelab-iac) so we get a single, complete trace per turn that includes:
Today, only the first bullet is captured. This issue is to align on which sources write what and how they merge in Langfuse.
What's currently happening
protoCLI (this repo,
packages/core/src/telemetry/sdk.ts):HttpInstrumentationand a Langfuse-specificBatchSpanProcessorthat exports to${LANGFUSE_BASE_URL}/api/public/otel/v1/tracesservice.name: qwen-codeon all spansgen_ai chat ${model}span emitted frompackages/core/src/core/openaiContentGenerator/pipeline.ts:569with attributes for system, model, usage, plus agen_ai.content.completioneventtraceparentheader is auto-propagated on outbound HTTP via HttpInstrumentation (so gateway could read it)delta.reasoning_contentanddelta.reasoningfrom chunks (line 156-160) but only for debug logging — not added to the spanLiteLLM gateway (
protoLabsAI/homelab-iac,stacks/ai/config/litellm/callbacks/thinking_normalizer.py):<think>...</think>markup from streamingdelta.contentbefore forwarding to clients (per protoLabsAI/homelab-iac#26)request_data["metadata"]["thinking"][<choice_idx>](per protoLabsAI/homelab-iac#28)success_callback: ["langfuse", "prometheus"]configuredservice.name: qwen-code(protoCLI). Zero come from LiteLLM's langfuse callback. The gateway-writtenmetadata.thinkingnever reaches any visible trace.Why this is happening
The PR that added
metadata.thinkingwas verified against a direct curl probe where LiteLLM's langfuse callback was the only writer — and it worked. But for actual protoCLI traffic, protoCLI's OTel SDK writes to Langfuse directly, bypassing whatever LiteLLM's callback does. The two paths don't merge.What full coverage would look like
A single Langfuse trace per protoCLI turn containing:
Three architecture options
Option A: Gateway returns thinking via response, protoCLI captures it as span attribute
x-llm-thinking) or appends a final SSE chunk with structured metadatapipeline.ts's stream handler already has the strippeddelta.reasoning_content/delta.reasoningavailable — extend it to also read the gateway-emitted thinking and setgen_ai.response.thinkingas a span attributePros: simplest from the trace-merging perspective; one writer.
Cons: custom response shape; loses gateway-internal observability (normalizer decisions, upstream timing).
Option B: Gateway emits its own OTel spans to the same Langfuse endpoint, joined via traceparent
traceparentvia HttpInstrumentationtraceparent, emits child spans under the same trace_id to Langfuse OTel endpointgateway.normalize(with thinking as attribute),gateway.upstream(with vLLM timing), etc.Pros: clean parent-child structure; both sides own their data; no custom response shape.
Cons: gateway needs OTel instrumentation added (more code); needs LiteLLM's langfuse callback disabled to avoid double-writing.
Option C: Status quo + small protoCLI capture of available reasoning fields
reasoning_contentfield on the final chunk (via the OpenAI-extensionreasoningfield convention)gen_ai.response.reasoningspan attribute when presentPros: smallest change, ships fast.
Cons: still single-source; loses gateway internals; works around current architecture rather than fixing it.
Proposed direction
Recommend Option B for completeness, with Option C as a stepping stone if Option B is too much scope.
Coordination needed:
metadata.thinkingwrite with a gateway span attribute. Disable LiteLLM's langfuse callback to avoid duplicate writes.gen_ai.response.thinking?llm.reasoning_content? Pick one and document.Happy to drive the gateway side. Looking for a protoCLI maintainer to:
Related context
metadata.thinkingwrite (the one not surfacing for protoCLI traffic)