tracing: coordinate Langfuse trace coverage across protoCLI + gateway (thinking, normalizer, upstream)

## Goal

Coordinate on Langfuse tracing between protoCLI (primary) and the LiteLLM gateway (homelab-iac) so we get a single, complete trace per turn that includes:

- protoCLI's existing structure (turn → agent.execute → tool calls → gen_ai chat)
- Model thinking traces (currently lost on streaming requests)
- Gateway-side processing decisions (normalizer firing, salvage paths, retries, fallbacks)
- Upstream model usage/timing as visible to the gateway

Today, only the first bullet is captured. This issue is to align on which sources write what and how they merge in Langfuse.

## What's currently happening

**protoCLI** (this repo, `packages/core/src/telemetry/sdk.ts`):
- Initializes OTel SDK with `HttpInstrumentation` and a Langfuse-specific `BatchSpanProcessor` that exports to `${LANGFUSE_BASE_URL}/api/public/otel/v1/traces`
- `service.name: qwen-code` on all spans
- `gen_ai chat ${model}` span emitted from `packages/core/src/core/openaiContentGenerator/pipeline.ts:569` with attributes for system, model, usage, plus a `gen_ai.content.completion` event
- W3C `traceparent` header is auto-propagated on outbound HTTP via HttpInstrumentation (so gateway *could* read it)
- pipeline.ts already extracts `delta.reasoning_content` and `delta.reasoning` from chunks (line 156-160) but only for debug logging — not added to the span

**LiteLLM gateway** (`protoLabsAI/homelab-iac`, `stacks/ai/config/litellm/callbacks/thinking_normalizer.py`):
- Strips `<think>...</think>` markup from streaming `delta.content` before forwarding to clients (per protoLabsAI/homelab-iac#26)
- Captures the stripped thinking text and writes to `request_data["metadata"]["thinking"][<choice_idx>]` (per protoLabsAI/homelab-iac#28)
- Has `success_callback: ["langfuse", "prometheus"]` configured
- **But:** I audited 500 most recent generations in Langfuse — 100% have `service.name: qwen-code` (protoCLI). Zero come from LiteLLM's langfuse callback. The gateway-written `metadata.thinking` never reaches any visible trace.

## Why this is happening

The PR that added `metadata.thinking` was verified against a **direct curl probe** where LiteLLM's langfuse callback was the only writer — and it worked. But for actual protoCLI traffic, protoCLI's OTel SDK writes to Langfuse directly, bypassing whatever LiteLLM's callback does. The two paths don't merge.

## What full coverage would look like

A single Langfuse trace per protoCLI turn containing:

```
turn (protoCLI)
├── agent.execute
│   ├── tool/<name>
│   ├── tool/<name>
│   └── gen_ai chat protolabs/smart   ← currently captured by protoCLI
│       ├── attributes:
│       │   ├── gen_ai.system, gen_ai.request.model, ...     [present]
│       │   ├── gen_ai.usage.{input,output,total}_tokens     [present]
│       │   └── gen_ai.response.thinking (e.g., 384 chars)    ← MISSING
│       ├── child span: gateway.normalize                    ← MISSING
│       │   ├── normalizer.thinking_captured (bool)
│       │   ├── normalizer.unclosed_think_salvaged (bool)
│       │   └── normalizer.duration_ms
│       └── child span: gateway.upstream.vllm                ← MISSING
│           └── upstream.duration_ms, upstream.first_token_ms
```

## Three architecture options

### Option A: Gateway returns thinking via response, protoCLI captures it as span attribute

- Gateway adds a custom HTTP response header (e.g., `x-llm-thinking`) or appends a final SSE chunk with structured metadata
- `pipeline.ts`'s stream handler already has the stripped `delta.reasoning_content` / `delta.reasoning` available — extend it to also read the gateway-emitted thinking and set `gen_ai.response.thinking` as a span attribute
- Single source (protoCLI), single trace, no merge needed

**Pros:** simplest from the trace-merging perspective; one writer.
**Cons:** custom response shape; loses gateway-internal observability (normalizer decisions, upstream timing).

### Option B: Gateway emits its own OTel spans to the same Langfuse endpoint, joined via traceparent

- protoCLI already auto-propagates `traceparent` via HttpInstrumentation
- LiteLLM gateway adds OTel SDK + reads incoming `traceparent`, emits child spans under the same trace_id to Langfuse OTel endpoint
- Spans: `gateway.normalize` (with thinking as attribute), `gateway.upstream` (with vLLM timing), etc.
- Both protoCLI and gateway write to Langfuse — Langfuse merges by trace_id

**Pros:** clean parent-child structure; both sides own their data; no custom response shape.
**Cons:** gateway needs OTel instrumentation added (more code); needs LiteLLM's langfuse callback disabled to avoid double-writing.

### Option C: Status quo + small protoCLI capture of available reasoning fields

- Gateway keeps stripping markup but also passes through a small `reasoning_content` field on the final chunk (via the OpenAI-extension `reasoning` field convention)
- pipeline.ts is updated to add `gen_ai.response.reasoning` span attribute when present
- Doesn't capture gateway-internal decisions

**Pros:** smallest change, ships fast.
**Cons:** still single-source; loses gateway internals; works around current architecture rather than fixing it.

## Proposed direction

Recommend **Option B** for completeness, with **Option C as a stepping stone** if Option B is too much scope.

Coordination needed:

| Side | Change |
|---|---|
| protoCLI | Verify W3C traceparent propagation reaches the gateway (should be automatic via HttpInstrumentation; confirm with a debug request). Add span attributes for thinking when present (Option C step). |
| homelab-iac (gateway) | Add OTel SDK to LiteLLM container, configure to export to same Langfuse OTel endpoint (Option B). Replace current `metadata.thinking` write with a gateway span attribute. Disable LiteLLM's langfuse callback to avoid duplicate writes. |
| Both | Decide canonical attribute names: `gen_ai.response.thinking`? `llm.reasoning_content`? Pick one and document. |

Happy to drive the gateway side. Looking for a protoCLI maintainer to:

1. Confirm/document the current OTel setup (which fields are captured, sample trace, what's the canonical attribute schema)
2. Discuss whether direction B is right or if there's a different roadmap protoCLI is already pursuing
3. Pick attribute names so we don't drift

## Related context

- protoLabsAI/homelab-iac#24 — original streaming salvage hook
- protoLabsAI/homelab-iac#26 — single-channel architecture, drop reasoning_content channel
- protoLabsAI/homelab-iac#28 / #29 — `metadata.thinking` write (the one not surfacing for protoCLI traffic)
- vllm-project/vllm#40816 — upstream parser bug we routed around


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing: coordinate Langfuse trace coverage across protoCLI + gateway (thinking, normalizer, upstream) #162

Goal

What's currently happening

Why this is happening

What full coverage would look like

Three architecture options

Option A: Gateway returns thinking via response, protoCLI captures it as span attribute

Option B: Gateway emits its own OTel spans to the same Langfuse endpoint, joined via traceparent

Option C: Status quo + small protoCLI capture of available reasoning fields

Proposed direction

Related context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Side	Change
protoCLI	Verify W3C traceparent propagation reaches the gateway (should be automatic via HttpInstrumentation; confirm with a debug request). Add span attributes for thinking when present (Option C step).
homelab-iac (gateway)	Add OTel SDK to LiteLLM container, configure to export to same Langfuse OTel endpoint (Option B). Replace current `metadata.thinking` write with a gateway span attribute. Disable LiteLLM's langfuse callback to avoid duplicate writes.
Both	Decide canonical attribute names: `gen_ai.response.thinking`? `llm.reasoning_content`? Pick one and document.

tracing: coordinate Langfuse trace coverage across protoCLI + gateway (thinking, normalizer, upstream) #162

Description

Goal

What's currently happening

Why this is happening

What full coverage would look like

Three architecture options

Option A: Gateway returns thinking via response, protoCLI captures it as span attribute

Option B: Gateway emits its own OTel spans to the same Langfuse endpoint, joined via traceparent

Option C: Status quo + small protoCLI capture of available reasoning fields

Proposed direction

Related context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions