fix(langfuse): accumulate usage across hook invocations for traces (#42306)#42327
fix(langfuse): accumulate usage across hook invocations for traces (#42306)#42327liuhao1024 wants to merge 1 commit into
Conversation
…ousResearch#42306) When `post_api_request` fires per-API-call with usage data but `post_llm_call` (from `turn_finalizer.py`) fires without `response` or `usage` kwargs, the handler fell through to empty `usage_details` and `cost_details`. This left Langfuse GENERATION spans without token counts or cost information. Fix: accumulate usage/cost in `TraceState` from each `post_api_request` invocation. When a subsequent `post_llm_call` fires without usage data, use the accumulated values as fallback. Also attach accumulated totals to the root trace in `_finish_trace` so the Langfuse dashboard shows usage at the trace level.
|
Thanks for digging into this one. I tried to verify the fix end-to-end against the actual hook plumbing and I don't think this change fixes #42306 — I believe the root cause is upstream of where this patch acts. Sharing my findings in case they're useful: The premise is half-right. The real root cause is the response gate, which this PR doesn't touch. I reproduced #42306 exactly on this branch — feeding the precise payload Because the extraction itself returns empty for the affected (sanitized-dict) case, the accumulated total attached to the root trace also accumulates zeros for those users. Tests pass but validate an unreachable state. #40560 looks like the correct fix — it gates on Happy to be wrong here — if you're seeing usage attach in a setup where |
|
Thanks for the thorough analysis @kshitijk4poor — you're right, and I appreciate you tracing the actual hook ordering. You've identified the real issue: I've confirmed #40560 by @kamonspecial addresses the correct root cause with a one-line guard on the sanitized-dict path. Closing this PR in favor of #40560. |
AGENTS.md was almost entirely how-to/mechanics with the want/don't-want
guidance implicit and scattered. Adds a single authoritative intent layer
near the top, calibrated against what actually merges and what actually
gets rejected.
- 'What Hermes Is': framing + the two properties that drive design
(prompt-cache integrity, narrow-waist core).
- 'Contribution Rubric': dual-purpose intent doc — (1) for humans/own work:
what gets merged vs rejected; (2) for the triage sweeper: when a PR is safe
to close on the three allowed reasons AND when NOT to close one. Taste-based
'won't implement / out of scope' closes stay human-only by design.
- 'What we want' calibrated against the last ~55 merges: fix real bugs well,
expand reach at the edges (platforms/channels/providers/models/desktop —
large features land routinely), refactor god-files into clean modules,
keep the CORE narrow. 'Expansive at the edges, conservative at the waist.'
- 'What we don't want': speculative hooks, .env-for-non-secrets, needless
core tools, lazy-read escape hatches, feature-destroying fixes, ungated
telemetry, change-detector tests, core-touching plugins.
- 'Before you call it a bug — verify the premise (and when NOT to close)':
distilled from real closes (#41741 intentional-design-not-a-gap, #41610
wrong-premise, #42327 fix-never-executes, #42393 deliberate-omission,
#41999 overreach). Doubles as sweeper guidance to avoid wrongly closing
legitimate PRs.
- 'The Footprint Ladder' (core-tool decision): extend > CLI+skill > gated tool
> plugin > MCP server in the catalog > new core tool (last resort).
Trim: 'Adding New Tools' intro points at the ladder. Detailed mechanics stay
where readers need them.
#42641) AGENTS.md was almost entirely how-to/mechanics with the want/don't-want guidance implicit and scattered. Adds a single authoritative intent layer near the top, calibrated against what actually merges and what actually gets rejected. - 'What Hermes Is': framing + the two properties that drive design (prompt-cache integrity, narrow-waist core). - 'Contribution Rubric': dual-purpose intent doc — (1) for humans/own work: what gets merged vs rejected; (2) for the triage sweeper: when a PR is safe to close on the three allowed reasons AND when NOT to close one. Taste-based 'won't implement / out of scope' closes stay human-only by design. - 'What we want' calibrated against the last ~55 merges: fix real bugs well, expand reach at the edges (platforms/channels/providers/models/desktop — large features land routinely), refactor god-files into clean modules, keep the CORE narrow. 'Expansive at the edges, conservative at the waist.' - 'What we don't want': speculative hooks, .env-for-non-secrets, needless core tools, lazy-read escape hatches, feature-destroying fixes, ungated telemetry, change-detector tests, core-touching plugins. - 'Before you call it a bug — verify the premise (and when NOT to close)': distilled from real closes (#41741 intentional-design-not-a-gap, #41610 wrong-premise, #42327 fix-never-executes, #42393 deliberate-omission, #41999 overreach). Doubles as sweeper guidance to avoid wrongly closing legitimate PRs. - 'The Footprint Ladder' (core-tool decision): extend > CLI+skill > gated tool > plugin > MCP server in the catalog > new core tool (last resort). Trim: 'Adding New Tools' intro points at the ladder. Detailed mechanics stay where readers need them.
NousResearch#42641) AGENTS.md was almost entirely how-to/mechanics with the want/don't-want guidance implicit and scattered. Adds a single authoritative intent layer near the top, calibrated against what actually merges and what actually gets rejected. - 'What Hermes Is': framing + the two properties that drive design (prompt-cache integrity, narrow-waist core). - 'Contribution Rubric': dual-purpose intent doc — (1) for humans/own work: what gets merged vs rejected; (2) for the triage sweeper: when a PR is safe to close on the three allowed reasons AND when NOT to close one. Taste-based 'won't implement / out of scope' closes stay human-only by design. - 'What we want' calibrated against the last ~55 merges: fix real bugs well, expand reach at the edges (platforms/channels/providers/models/desktop — large features land routinely), refactor god-files into clean modules, keep the CORE narrow. 'Expansive at the edges, conservative at the waist.' - 'What we don't want': speculative hooks, .env-for-non-secrets, needless core tools, lazy-read escape hatches, feature-destroying fixes, ungated telemetry, change-detector tests, core-touching plugins. - 'Before you call it a bug — verify the premise (and when NOT to close)': distilled from real closes (NousResearch#41741 intentional-design-not-a-gap, NousResearch#41610 wrong-premise, NousResearch#42327 fix-never-executes, NousResearch#42393 deliberate-omission, NousResearch#41999 overreach). Doubles as sweeper guidance to avoid wrongly closing legitimate PRs. - 'The Footprint Ladder' (core-tool decision): extend > CLI+skill > gated tool > plugin > MCP server in the catalog > new core tool (last resort). Trim: 'Adding New Tools' intro points at the ladder. Detailed mechanics stay where readers need them.
What does this PR do?
Accumulates usage/cost data in the langfuse plugin's
TraceStateso that whenpost_llm_callfires withoutresponseorusagekwargs (as happens viaturn_finalizer.py), the handler can fall back to values already collected from earlierpost_api_requestinvocations. Also attaches accumulated usage totals to the root trace in_finish_trace.Related Issue
Fixes #42306
Type of Change
Changes Made
plugins/observability/langfuse/__init__.py: Addedaccumulated_usage_detailsandaccumulated_cost_detailsfields toTraceState. Modifiedon_post_llm_callto accumulate usage/cost from each invocation and use accumulated values as fallback whenpost_llm_callfires withoutresponse/usage. Modified_finish_traceto attach accumulated usage/cost to the root trace span.tests/plugins/test_langfuse_usage_accumulation.py: Added 6 tests covering accumulation across API calls, fallback to accumulated values, empty fallback safety, and root-trace usage attachment.How to Test
hermes plugins enable observability/langfuse~/.hermes/.env(HERMES_LANGFUSE_PUBLIC_KEY, HERMES_LANGFUSE_SECRET_KEY)hermes -p <profile> -m "test"pytest tests/plugins/test_langfuse_plugin.py tests/plugins/test_langfuse_usage_accumulation.py -v— all 43 tests should passChecklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/plugins/test_langfuse_plugin.py tests/plugins/test_langfuse_usage_accumulation.py -vand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/ACode Intelligence
plugins/observability/langfuse/__init__.py—TraceState,on_post_llm_call,_finish_tracepost_api_requestvspost_llm_callhook parameter asymmetry —turn_finalizer.pyfirespost_llm_callwithoutresponse/usagekwargs whileconversation_loop.pyfirespost_api_requestwith full usage data