Skip to content

fix(langfuse): accumulate usage across hook invocations for traces (#42306)#42327

Closed
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/langfuse-usage-accumulation
Closed

fix(langfuse): accumulate usage across hook invocations for traces (#42306)#42327
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/langfuse-usage-accumulation

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

What does this PR do?

Accumulates usage/cost data in the langfuse plugin's TraceState so that when post_llm_call fires without response or usage kwargs (as happens via turn_finalizer.py), the handler can fall back to values already collected from earlier post_api_request invocations. Also attaches accumulated usage totals to the root trace in _finish_trace.

Related Issue

Fixes #42306

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • plugins/observability/langfuse/__init__.py: Added accumulated_usage_details and accumulated_cost_details fields to TraceState. Modified on_post_llm_call to accumulate usage/cost from each invocation and use accumulated values as fallback when post_llm_call fires without response/usage. Modified _finish_trace to attach accumulated usage/cost to the root trace span.
  • tests/plugins/test_langfuse_usage_accumulation.py: Added 6 tests covering accumulation across API calls, fallback to accumulated values, empty fallback safety, and root-trace usage attachment.

How to Test

  1. Enable the langfuse plugin: hermes plugins enable observability/langfuse
  2. Set valid Langfuse credentials in ~/.hermes/.env (HERMES_LANGFUSE_PUBLIC_KEY, HERMES_LANGFUSE_SECRET_KEY)
  3. Run a conversation: hermes -p <profile> -m "test"
  4. Check the Langfuse dashboard — GENERATION spans should now show usage/token counts and cost
  5. Run pytest tests/plugins/test_langfuse_plugin.py tests/plugins/test_langfuse_usage_accumulation.py -v — all 43 tests should pass

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/plugins/test_langfuse_plugin.py tests/plugins/test_langfuse_usage_accumulation.py -v and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Code Intelligence

  • Analyzed: plugins/observability/langfuse/__init__.pyTraceState, on_post_llm_call, _finish_trace
  • Blast radius: LOW — changes are confined to the langfuse plugin's internal state management; no external APIs or other plugins affected
  • Related patterns: post_api_request vs post_llm_call hook parameter asymmetry — turn_finalizer.py fires post_llm_call without response/usage kwargs while conversation_loop.py fires post_api_request with full usage data

…ousResearch#42306)

When `post_api_request` fires per-API-call with usage data but
`post_llm_call` (from `turn_finalizer.py`) fires without `response`
or `usage` kwargs, the handler fell through to empty `usage_details`
and `cost_details`. This left Langfuse GENERATION spans without token
counts or cost information.

Fix: accumulate usage/cost in `TraceState` from each `post_api_request`
invocation. When a subsequent `post_llm_call` fires without usage data,
use the accumulated values as fallback. Also attach accumulated totals
to the root trace in `_finish_trace` so the Langfuse dashboard shows
usage at the trace level.
@alt-glitch alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins telemetry Touches outbound telemetry, usage attribution, or analytics — needs opt-in gating before merge P3 Low — cosmetic, nice to have labels Jun 8, 2026
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Thanks for digging into this one. I tried to verify the fix end-to-end against the actual hook plumbing and I don't think this change fixes #42306 — I believe the root cause is upstream of where this patch acts. Sharing my findings in case they're useful:

The premise is half-right. turn_finalizer.py does fire post_llm_call without response/usage, and it routes to the same on_post_llm_call. But that invocation early-returns at the generation = state.generations.pop(req_key, None) / if generation is None: return guard, because the generation for api_call_count=0 was already popped by the matching post_api_request. So the new fallback branch (the else: usage_details = dict(state.accumulated_usage_details) path) and the post_llm_call→accumulated mechanism never actually execute at runtime. I simulated the real hook ordering (single-call and multi-call turns) and the fallback produces zero attachments in every case.

The real root cause is the response gate, which this PR doesn't touch. conversation_loop.py passes response= as a sanitized dict (_api_response_payload_for_hook_sanitize_hook_payload({...})). In on_post_llm_call, if response is not None: swallows that dict and calls _usage_and_cost(response, ...), which does raw_usage = getattr(response, "usage", None) — a dict has no .usage attribute, so it returns None, returns {}, {}, and the elif isinstance(usage, dict) fallback is never reached. Usage is silently dropped.

I reproduced #42306 exactly on this branch — feeding the precise payload conversation_loop sends (sanitized dict response + separate usage dict), the GENERATION span ends with usage_details = {}:

on_pre_llm_request(api_call_count=0, ...)
on_post_llm_call(
    api_call_count=0,
    response={"model": "...", "usage": {"input_tokens": 15276, "output_tokens": 45}},  # sanitized DICT
    usage={"input_tokens": 15276, "output_tokens": 45},
    ...)
# GENERATION span usage_details -> {}   (still empty on this branch)

Because the extraction itself returns empty for the affected (sanitized-dict) case, the accumulated total attached to the root trace also accumulates zeros for those users.

Tests pass but validate an unreachable state. test_fallback_uses_accumulated_usage manually does state.generations["0"] = gen and pre-seeds accumulated_usage_details. In real runtime post_api_request pops generations["0"] before post_llm_call fires, so that branch can't be hit.

#40560 looks like the correct fix — it gates on getattr(response, "usage", None) is not None so sanitized dicts fall through to the usage-dict branch. I confirmed that one-line change resolves the exact scenario this PR leaves broken. Might be worth coordinating with that PR rather than landing this one.

Happy to be wrong here — if you're seeing usage attach in a setup where post_api_request doesn't fire (so only post_llm_call carries data), let me know the config and I'll re-check.

@liuhao1024

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough analysis @kshitijk4poor — you're right, and I appreciate you tracing the actual hook ordering.

You've identified the real issue: conversation_loop.py passes response as a sanitized dict via _api_response_payload_for_hook, so getattr(response, "usage", None) in on_post_llm_call returns None (dicts don't have a .usage attribute), and usage is silently dropped. The fallback branch I added never executes because post_api_request already pops the generation before post_llm_call fires.

I've confirmed #40560 by @kamonspecial addresses the correct root cause with a one-line guard on the sanitized-dict path. Closing this PR in favor of #40560.

@liuhao1024 liuhao1024 closed this Jun 8, 2026
teknium1 added a commit that referenced this pull request Jun 9, 2026
AGENTS.md was almost entirely how-to/mechanics with the want/don't-want
guidance implicit and scattered. Adds a single authoritative intent layer
near the top, calibrated against what actually merges and what actually
gets rejected.

- 'What Hermes Is': framing + the two properties that drive design
  (prompt-cache integrity, narrow-waist core).
- 'Contribution Rubric': dual-purpose intent doc — (1) for humans/own work:
  what gets merged vs rejected; (2) for the triage sweeper: when a PR is safe
  to close on the three allowed reasons AND when NOT to close one. Taste-based
  'won't implement / out of scope' closes stay human-only by design.
  - 'What we want' calibrated against the last ~55 merges: fix real bugs well,
    expand reach at the edges (platforms/channels/providers/models/desktop —
    large features land routinely), refactor god-files into clean modules,
    keep the CORE narrow. 'Expansive at the edges, conservative at the waist.'
  - 'What we don't want': speculative hooks, .env-for-non-secrets, needless
    core tools, lazy-read escape hatches, feature-destroying fixes, ungated
    telemetry, change-detector tests, core-touching plugins.
  - 'Before you call it a bug — verify the premise (and when NOT to close)':
    distilled from real closes (#41741 intentional-design-not-a-gap, #41610
    wrong-premise, #42327 fix-never-executes, #42393 deliberate-omission,
    #41999 overreach). Doubles as sweeper guidance to avoid wrongly closing
    legitimate PRs.
- 'The Footprint Ladder' (core-tool decision): extend > CLI+skill > gated tool
  > plugin > MCP server in the catalog > new core tool (last resort).

Trim: 'Adding New Tools' intro points at the ladder. Detailed mechanics stay
where readers need them.
teknium1 added a commit that referenced this pull request Jun 10, 2026
#42641)

AGENTS.md was almost entirely how-to/mechanics with the want/don't-want
guidance implicit and scattered. Adds a single authoritative intent layer
near the top, calibrated against what actually merges and what actually
gets rejected.

- 'What Hermes Is': framing + the two properties that drive design
  (prompt-cache integrity, narrow-waist core).
- 'Contribution Rubric': dual-purpose intent doc — (1) for humans/own work:
  what gets merged vs rejected; (2) for the triage sweeper: when a PR is safe
  to close on the three allowed reasons AND when NOT to close one. Taste-based
  'won't implement / out of scope' closes stay human-only by design.
  - 'What we want' calibrated against the last ~55 merges: fix real bugs well,
    expand reach at the edges (platforms/channels/providers/models/desktop —
    large features land routinely), refactor god-files into clean modules,
    keep the CORE narrow. 'Expansive at the edges, conservative at the waist.'
  - 'What we don't want': speculative hooks, .env-for-non-secrets, needless
    core tools, lazy-read escape hatches, feature-destroying fixes, ungated
    telemetry, change-detector tests, core-touching plugins.
  - 'Before you call it a bug — verify the premise (and when NOT to close)':
    distilled from real closes (#41741 intentional-design-not-a-gap, #41610
    wrong-premise, #42327 fix-never-executes, #42393 deliberate-omission,
    #41999 overreach). Doubles as sweeper guidance to avoid wrongly closing
    legitimate PRs.
- 'The Footprint Ladder' (core-tool decision): extend > CLI+skill > gated tool
  > plugin > MCP server in the catalog > new core tool (last resort).

Trim: 'Adding New Tools' intro points at the ladder. Detailed mechanics stay
where readers need them.
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
NousResearch#42641)

AGENTS.md was almost entirely how-to/mechanics with the want/don't-want
guidance implicit and scattered. Adds a single authoritative intent layer
near the top, calibrated against what actually merges and what actually
gets rejected.

- 'What Hermes Is': framing + the two properties that drive design
  (prompt-cache integrity, narrow-waist core).
- 'Contribution Rubric': dual-purpose intent doc — (1) for humans/own work:
  what gets merged vs rejected; (2) for the triage sweeper: when a PR is safe
  to close on the three allowed reasons AND when NOT to close one. Taste-based
  'won't implement / out of scope' closes stay human-only by design.
  - 'What we want' calibrated against the last ~55 merges: fix real bugs well,
    expand reach at the edges (platforms/channels/providers/models/desktop —
    large features land routinely), refactor god-files into clean modules,
    keep the CORE narrow. 'Expansive at the edges, conservative at the waist.'
  - 'What we don't want': speculative hooks, .env-for-non-secrets, needless
    core tools, lazy-read escape hatches, feature-destroying fixes, ungated
    telemetry, change-detector tests, core-touching plugins.
  - 'Before you call it a bug — verify the premise (and when NOT to close)':
    distilled from real closes (NousResearch#41741 intentional-design-not-a-gap, NousResearch#41610
    wrong-premise, NousResearch#42327 fix-never-executes, NousResearch#42393 deliberate-omission,
    NousResearch#41999 overreach). Doubles as sweeper guidance to avoid wrongly closing
    legitimate PRs.
- 'The Footprint Ladder' (core-tool decision): extend > CLI+skill > gated tool
  > plugin > MCP server in the catalog > new core tool (last resort).

Trim: 'Adding New Tools' intro points at the ladder. Detailed mechanics stay
where readers need them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have telemetry Touches outbound telemetry, usage attribution, or analytics — needs opt-in gating before merge type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: observability/langfuse: GENERATION spans created but missing usage/token counts/costs (no post-LLM hook activity logged)

3 participants