Skip to content

feat(telemetry): comprehensive Langfuse tracing across all providers#1

Merged
mabry1985 merged 1 commit into
mainfrom
feat/langfuse-tracing
Apr 2, 2026
Merged

feat(telemetry): comprehensive Langfuse tracing across all providers#1
mabry1985 merged 1 commit into
mainfrom
feat/langfuse-tracing

Conversation

@mabry1985

Copy link
Copy Markdown

Summary

  • LLM spans on all 3 providers — OpenAI-compat, Anthropic, Gemini each emit gen_ai chat {model} spans with gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.usage.total_tokens for Langfuse cost dashboards
  • Turn hierarchy — per-turn root span (turn) in client.ts using OTel context propagation; all child spans (LLM, tool, agent) nest under it in Langfuse's trace view
  • Tool + subagent spans — every tool execution in coreToolScheduler and every subagent in agent.ts (foreground + background) wrapped in child spans
  • Content logging — prompt/response span events gated by telemetryLogPrompts (default on), truncated at 10k chars
  • 26 new tests — Langfuse activation, turn span lifecycle, default URL, auth header encoding
  • Docs — README Observability section + AGENTS.md config note

Set LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY and every session is fully traced. No other config needed.

Test plan

  • CI passes
  • Set Langfuse env vars locally, run proto, verify session → turn → LLM/tool spans appear in Langfuse
  • Verify telemetry.enabled: false still suppresses OTLP pipeline but Langfuse traces still flow

🤖 Generated with Claude Code

Add full OTel span instrumentation so setting LANGFUSE_PUBLIC_KEY +
LANGFUSE_SECRET_KEY gives end-to-end trace visibility in Langfuse.

- LLM spans: all 3 providers (OpenAI-compat, Anthropic, Gemini) emit
  gen_ai spans with token usage attrs (input/output/total) for cost tracking
- Turn hierarchy: per-turn root span in client.ts with context propagation
  so LLM/tool/agent spans nest correctly in Langfuse trace view
- Tool spans: coreToolScheduler wraps every tool execution in a child span
  with name, type, decision, duration, and error attributes
- Agent spans: agent.ts wraps subagent execution (foreground + background)
  with full lifecycle coverage
- Content logging: prompt/response span events gated by telemetryLogPrompts
- Tests: 26 new tests covering Langfuse activation and turn span lifecycle
- Docs: README Observability section + AGENTS.md config note

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mabry1985 mabry1985 merged commit 6a61a19 into main Apr 2, 2026
2 of 3 checks passed
mabry1985 pushed a commit that referenced this pull request Apr 7, 2026
…esis

Based on analysis of 35+ sources including JetBrains, Anthropic, SWE-bench
leaderboard, and Wink's 42k-trajectory production study.

## Changes

**Doom-loop fingerprinting (loopDetectionService + agent-core)**
Replace consecutive-only threshold=5 with a sliding 20-call window where
any fingerprint appearing 3+ times = doom loop. Catches non-consecutive
repetition patterns that the old approach missed (the #1 recovery category
in production, 39% of all interventions per Wink).

**Silent sensors (baselineCheck + postEditVerify)**
Remove PASSED output from baseline verify — silent on pass preserves context
budget. Structured remediation steps on failure: read error → fix root cause
→ re-run command.

**Read-only plan subagent (builtin-agents)**
Add `plan` builtin agent with write tools structurally absent from its
schema. Prevents the "accidental edit during planning" failure mode that
every successful harness independently converged on.

**Checkpoint commits (agentCore + agent-core)**
Add `gitSnapshotBeforeEdit()` — creates a named shadow-repo commit before
every file-mutating tool call in the AgentCore path. Durable across crashes,
fire-and-forget so it never blocks tool execution. Pairs with existing
in-memory CheckpointStore for dual-layer rollback.

**Scope lock (scopeLock service + agent-core)**
New `ScopeLockService` singleton — activated from a sprint contract with a
permitted file set. Any write outside the set is intercepted before the tool
executes, returning a structured violation message and blocking the edit.
Addresses the 6.62% "unrequested changes" failure category.

**Observation masking (chatCompressionService + agent-core)**
Add `applyObservationMask()` — replaces old tool call/result pairs with a
placeholder, keeping the last N verbatim. Applied before LLM compression in
the AgentCore compaction path. JetBrains (2025): observation masking reduces
peak tokens 26-54% while LLM summarisation made agents run 15% longer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mabry1985 added a commit that referenced this pull request May 1, 2026
… example shape to description (#177)

Reported failure mode: smaller / older models (Qwen3, Minimax variants,
some open-weights routes) JSON-encode the entire `questions` array as a
string instead of emitting it as a native array literal. validateToolParams
threw with "Parameter \"questions\" must be an array" — useless feedback
since the model HAD sent the array, just stringified.

Three changes, layered:

1. Silent coercion in validateToolParams. If `questions` is a string that
   parses as JSON, parse it and continue. Logs a debugLogger.warn so the
   signal stays visible — silent coercion would mask a real upstream
   regression if model behavior shifts. Catches ~all of the user-reported
   failures without a retry round-trip.

2. Example shape added to the tool description. Models replicate concrete
   examples better than they synthesize from abstract schemas with 3
   levels of nesting. Placeholder text is clearly labeled as shape-only
   so models don't cargo-cult the example values into their actual
   questions.

3. Sharper error message for the residual case (non-JSON garbage in the
   string slot): "Pass `questions` as a real array literal, not a
   JSON-encoded string." Clear, specific, tells the model exactly which
   strategy to drop.

Considered but rejected:
- Schema relaxation (allow `options: string[]`, default multiSelect):
  API change, breaks downstream `option.description` consumers (dialog
  UI, ACP renderer), premature without data showing #1+#2 are
  insufficient.

Tests updated with two new cases: stringified-array coercion, non-JSON
string error path. 5319 core / 0 fail; lint + typecheck clean.

Co-authored-by: Automaker <automaker@localhost>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant