Skip to content

fix(langfuse): fixes langfuse observability - llm input/output and tool output missing#22345

Closed
btorresgil wants to merge 2 commits into
NousResearch:mainfrom
btorresgil:fix/langfuse
Closed

fix(langfuse): fixes langfuse observability - llm input/output and tool output missing#22345
btorresgil wants to merge 2 commits into
NousResearch:mainfrom
btorresgil:fix/langfuse

Conversation

@btorresgil

@btorresgil btorresgil commented May 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?


Note: Depends on #12928 to be a complete fix.

This fix works either way, just without #12928 the post_tool_call won't fire for
built-in tools (todo, memory, clarify, delegate_task, memory-provider tools), so those tool
calls still won't appear as observations in Langfuse even after this PR.

This is still a fix for all other tools besides these few built-in tools which are fixed by #12928.

I just didn't want to duplicate that PR here since it fixes a specific but related issue.


This PR fixes two independent observability gaps in the Langfuse plugin:

  1. LLM generation inputs were empty and outputs missing the message on_pre_llm_request read only the messages kwarg, but depending on Hermes version and
    call path the actual request payload arrives as request_messages, conversation_history, or user_message. Generations showed
    an empty input instead of the real outbound message list.

  2. Tool observations had no output. The Tool: child span in Langfuse showed input (arguments) but Output: undefined.
    Root cause: when Hermes fires pre_tool_call and post_tool_call without a tool_call_id (the common case for most built-in and
    registry tools), the plugin stored the observation under a unique time.time_ns()-based key in on_pre_tool_call that
    on_post_tool_call could never reconstruct. The post hook fell back to popitem() which is unreliable under concurrent tool
    calls. Every tool span ended up closed by the _finish_trace sweep with no output set.

Related Issue

Fixes #22342

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

plugins/observability/langfuse/init.py

  • FIFO queue for empty tool_call_id. TraceState gains a pending_tools_by_name: Dict[str, list] field. on_pre_tool_call
    routes observations into the id-keyed tools dict when a tool_call_id is present, and into a per-name FIFO queue otherwise.
    on_post_tool_call mirrors this: pop by id first, then drain the front of the per-name queue. This guarantees pre/post always
    match the same observation in the order calls were dispatched. _finish_trace drains any leftover entries in
    pending_tools_by_name so no spans are left hanging.
  • _coerce_request_messages helper. Picks the first available list from request_messages → messages → conversation_history,
    falling back to a synthetic [{role: user, content: user_message}]. Used in on_pre_llm_request so the generation's input
    reflects the real outbound payload regardless of which kwarg the runtime used.
  • Tool call serialization. _serialize_tool_calls now emits the OpenAI-style nested function: {name, arguments} shape
    alongside the flat fields (Langfuse's UI uses the nested form for rendering). arguments is kept as a string rather than
    parsed to a dict, matching what Langfuse expects.
  • Tool message serialization. _serialize_messages now preserves name alongside tool_call_id on role: tool messages, which
    Langfuse needs to pair results with their tool calls.
  • Result backfill. In on_post_tool_call, when a tool_call_id is present the matching entry in state.turn_tool_calls is
    updated with the result so the generation-level tool-call record carries the output, not just the arguments.

run_agent.py / model_tools.py (depends on #12928)

  • Removes _fire_post_tool_call_hook and all its call sites from the built-in tool branches (todo, session_search, memory,
    clarify, delegate_task, memory-provider tools). Removes the skip_post_tool_call_hook parameter from handle_function_call.
    Hook emission for these tools is now handled by the centralised mechanism in fix(hooks): fire post_tool_call hook for built-in tools (closes #12922) #12928.

How to Test

  • tests/plugins/test_langfuse_plugin.py — new TestToolObservationKeying class covers: single empty-tool_call_id round-trip
    sets output; concurrent same-name tools are consumed FIFO without swapping outputs; non-empty tool_call_id path is
    unaffected. New TestToolCallOutputBackfill covers backfill, tool-message serialization, and OpenAI tool-call shape. New
    TestRequestMessageCoercion covers all four kwarg variants.
  • tests/run_agent/test_run_agent.py — updated to reflect removal of skip_post_tool_call_hook.

Steps:

  1. Configure langfuse
  2. In a new session, ask hermes to perform a tool call: "Use a tool to fetch example.com"
  3. Check langfuse and observe that the LLM input and output and the tool input and output are now present.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform:

Documentation & Housekeeping

  • [N/A] I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • [N/A] I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • [N/A] I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • [N/A] I've updated tool descriptions/schemas if I changed tool behavior — or N/A

btorresgil added 2 commits May 8, 2026 23:04
on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.
Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/agent Core agent loop, run_agent.py, prompt builder labels May 9, 2026
@btorresgil btorresgil changed the title Fix langfuse observability - llm input/output and Fix langfuse observability - llm input/output and tool output missing May 9, 2026
@btorresgil btorresgil changed the title Fix langfuse observability - llm input/output and tool output missing fix(langfuse): fixes langfuse observability - llm input/output and tool output missing May 9, 2026
@zh820952173

Copy link
Copy Markdown

Quickly!!!!!!

kshitijk4poor added a commit that referenced this pull request May 15, 2026
…#26319)

PR #22345 by @btorresgil authors commits as 'Brian Conklin
<brian@dralth.com>' (git config carries a different name/email than the
GitHub account). GitHub's commit-author mapping correctly attributes these
commits to @btorresgil based on the public-key registration, but Hermes'
release attribution audit reads the raw commit email, not the GitHub
mapping. Without this AUTHOR_MAP entry, salvaging #22345 would fail
`scripts/contributor_audit.py` strict mode at release time.

Prerequisite for the langfuse trace fix salvage that cherry-picks
@btorresgil's commits onto current main.
kshitijk4poor added a commit that referenced this pull request May 15, 2026
…placeholder credentials (closes #22342, #22763) (#26320)

* fix(langfuse): reject placeholder credentials with one-shot warning

When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY
at a template value like 'placeholder', 'test-key', or 'your-langfuse-key',
the Langfuse SDK silently accepts the credentials at construction time and
drops every trace at flush time. No warning, no error — just an empty
Langfuse dashboard the operator only notices hours later.

Add prefix-based validation in _get_langfuse() against the documented
'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side.
Anything else fires a single warning naming the offending env var(s)
with a log-safe value preview (full string for short placeholders so the
operator knows which template they left in place; truncated for long
values so a real secret pasted into the wrong field never hits the log),
then short-circuits via the existing _INIT_FAILED cache so the warning
fires once per process, not once per hook invocation.

The check sits after the 'Langfuse is None' SDK-installed guard so hosts
without the optional langfuse SDK don't see misleading 'set real keys'
hints when the actionable fix is 'pip install langfuse'. Missing
credentials remains the documented opt-out path and stays silent — no
log noise for unconfigured installs.

Fixes #22763
Fixes #23823

* fix(langfuse): use actual API request messages for generation input

on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.

* fix(langfuse): record tool call outputs in traces

Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)

* fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test

Self-review of the combined #22345 + #23831 salvage surfaced three issues
worth fixing in the same PR rather than as follow-ups:

1. Drop is_first_turn from the pre_api_request hook. The boolean expression
   `not bool(conversation_history)` was wrong: conversation_history is
   reassigned to None mid-run after compression (5 sites in run_agent.py),
   so the value flips False -> True mid-conversation on every post-compression
   API call. The langfuse plugin never consumed it, so the kwarg was both
   misleading AND dead.

2. Replace copy.deepcopy(request_messages) with shallow list() copy. The
   pre_api_request hook contract discards return values (invoke_hook never
   writes back to api_kwargs), and the langfuse plugin's _serialize_messages
   already builds its own snapshot dicts via _safe_value. A deepcopy on every
   API call would walk every tool result and base64 image — significant
   overhead for no real isolation benefit. Shallow copy of the outer list
   protects against later mutations of api_messages without paying for the
   inner-dict walk.

3. Rename test_empty_tool_call_id_concurrent_fifo_order ->
   test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a
   real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8
   threads behind a barrier to actually exercise _STATE_LOCK on the
   pending_tools_by_name queue. The original test was sequential and only
   validated Python list semantics; this one validates the lock discipline.

4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED —
   that function does not exist. Tests reload the module via sys.modules.pop
   + importlib.import_module instead.

Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.

---------

Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: Brian Conklin <brian@dralth.com>
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Thanks @btorresgil — superseded by #26320 which cherry-picks both of your commits with author preserved:

  • fix(langfuse): use actual API request messages for generation input
  • fix(langfuse): record tool call outputs in traces

The salvage combines these with @xxxigm's placeholder-credential fix (#23831) into one logical Langfuse fix landing on main. The dependency on #12928 mentioned in your PR body turned out to be already satisfied — post_tool_call is now centrally fired for all tools in model_tools.py:782-795.

Self-review polish on top (4th commit): removed a dead+buggy is_first_turn kwarg, replaced the copy.deepcopy(request_messages) with shallow list() (the hook contract doesn't allow mutation back to api_kwargs anyway), and added a real threaded FIFO test that actually exercises _STATE_LOCK instead of just Python list semantics.

Closing as superseded.

DIZ-admin pushed a commit to DIZ-admin/hermes-agent that referenced this pull request May 16, 2026
…2345 salvage (NousResearch#26319)

PR NousResearch#22345 by @btorresgil authors commits as 'Brian Conklin
<brian@dralth.com>' (git config carries a different name/email than the
GitHub account). GitHub's commit-author mapping correctly attributes these
commits to @btorresgil based on the public-key registration, but Hermes'
release attribution audit reads the raw commit email, not the GitHub
mapping. Without this AUTHOR_MAP entry, salvaging NousResearch#22345 would fail
`scripts/contributor_audit.py` strict mode at release time.

Prerequisite for the langfuse trace fix salvage that cherry-picks
@btorresgil's commits onto current main.
DIZ-admin pushed a commit to DIZ-admin/hermes-agent that referenced this pull request May 16, 2026
…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320)

* fix(langfuse): reject placeholder credentials with one-shot warning

When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY
at a template value like 'placeholder', 'test-key', or 'your-langfuse-key',
the Langfuse SDK silently accepts the credentials at construction time and
drops every trace at flush time. No warning, no error — just an empty
Langfuse dashboard the operator only notices hours later.

Add prefix-based validation in _get_langfuse() against the documented
'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side.
Anything else fires a single warning naming the offending env var(s)
with a log-safe value preview (full string for short placeholders so the
operator knows which template they left in place; truncated for long
values so a real secret pasted into the wrong field never hits the log),
then short-circuits via the existing _INIT_FAILED cache so the warning
fires once per process, not once per hook invocation.

The check sits after the 'Langfuse is None' SDK-installed guard so hosts
without the optional langfuse SDK don't see misleading 'set real keys'
hints when the actionable fix is 'pip install langfuse'. Missing
credentials remains the documented opt-out path and stays silent — no
log noise for unconfigured installs.

Fixes NousResearch#22763
Fixes NousResearch#23823

* fix(langfuse): use actual API request messages for generation input

on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.

* fix(langfuse): record tool call outputs in traces

Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)

* fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test

Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues
worth fixing in the same PR rather than as follow-ups:

1. Drop is_first_turn from the pre_api_request hook. The boolean expression
   `not bool(conversation_history)` was wrong: conversation_history is
   reassigned to None mid-run after compression (5 sites in run_agent.py),
   so the value flips False -> True mid-conversation on every post-compression
   API call. The langfuse plugin never consumed it, so the kwarg was both
   misleading AND dead.

2. Replace copy.deepcopy(request_messages) with shallow list() copy. The
   pre_api_request hook contract discards return values (invoke_hook never
   writes back to api_kwargs), and the langfuse plugin's _serialize_messages
   already builds its own snapshot dicts via _safe_value. A deepcopy on every
   API call would walk every tool result and base64 image — significant
   overhead for no real isolation benefit. Shallow copy of the outer list
   protects against later mutations of api_messages without paying for the
   inner-dict walk.

3. Rename test_empty_tool_call_id_concurrent_fifo_order ->
   test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a
   real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8
   threads behind a barrier to actually exercise _STATE_LOCK on the
   pending_tools_by_name queue. The original test was sequential and only
   validated Python list semantics; this one validates the lock discipline.

4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED —
   that function does not exist. Tests reload the module via sys.modules.pop
   + importlib.import_module instead.

Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.

---------

Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: Brian Conklin <brian@dralth.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…2345 salvage (NousResearch#26319)

PR NousResearch#22345 by @btorresgil authors commits as 'Brian Conklin
<brian@dralth.com>' (git config carries a different name/email than the
GitHub account). GitHub's commit-author mapping correctly attributes these
commits to @btorresgil based on the public-key registration, but Hermes'
release attribution audit reads the raw commit email, not the GitHub
mapping. Without this AUTHOR_MAP entry, salvaging NousResearch#22345 would fail
`scripts/contributor_audit.py` strict mode at release time.

Prerequisite for the langfuse trace fix salvage that cherry-picks
@btorresgil's commits onto current main.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320)

* fix(langfuse): reject placeholder credentials with one-shot warning

When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY
at a template value like 'placeholder', 'test-key', or 'your-langfuse-key',
the Langfuse SDK silently accepts the credentials at construction time and
drops every trace at flush time. No warning, no error — just an empty
Langfuse dashboard the operator only notices hours later.

Add prefix-based validation in _get_langfuse() against the documented
'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side.
Anything else fires a single warning naming the offending env var(s)
with a log-safe value preview (full string for short placeholders so the
operator knows which template they left in place; truncated for long
values so a real secret pasted into the wrong field never hits the log),
then short-circuits via the existing _INIT_FAILED cache so the warning
fires once per process, not once per hook invocation.

The check sits after the 'Langfuse is None' SDK-installed guard so hosts
without the optional langfuse SDK don't see misleading 'set real keys'
hints when the actionable fix is 'pip install langfuse'. Missing
credentials remains the documented opt-out path and stays silent — no
log noise for unconfigured installs.

Fixes NousResearch#22763
Fixes NousResearch#23823

* fix(langfuse): use actual API request messages for generation input

on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.

* fix(langfuse): record tool call outputs in traces

Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)

* fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test

Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues
worth fixing in the same PR rather than as follow-ups:

1. Drop is_first_turn from the pre_api_request hook. The boolean expression
   `not bool(conversation_history)` was wrong: conversation_history is
   reassigned to None mid-run after compression (5 sites in run_agent.py),
   so the value flips False -> True mid-conversation on every post-compression
   API call. The langfuse plugin never consumed it, so the kwarg was both
   misleading AND dead.

2. Replace copy.deepcopy(request_messages) with shallow list() copy. The
   pre_api_request hook contract discards return values (invoke_hook never
   writes back to api_kwargs), and the langfuse plugin's _serialize_messages
   already builds its own snapshot dicts via _safe_value. A deepcopy on every
   API call would walk every tool result and base64 image — significant
   overhead for no real isolation benefit. Shallow copy of the outer list
   protects against later mutations of api_messages without paying for the
   inner-dict walk.

3. Rename test_empty_tool_call_id_concurrent_fifo_order ->
   test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a
   real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8
   threads behind a barrier to actually exercise _STATE_LOCK on the
   pending_tools_by_name queue. The original test was sequential and only
   validated Python list semantics; this one validates the lock discipline.

4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED —
   that function does not exist. Tests reload the module via sys.modules.pop
   + importlib.import_module instead.

Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.

---------

Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: Brian Conklin <brian@dralth.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Langfuse traces missing LLM input/output and tool output

4 participants