fix(langfuse): fixes langfuse observability - llm input/output and tool output missing by btorresgil · Pull Request #22345 · NousResearch/hermes-agent

btorresgil · 2026-05-09T06:47:49Z

What does this PR do?

Note: Depends on #12928 to be a complete fix.

This fix works either way, just without #12928 the post_tool_call won't fire for
built-in tools (todo, memory, clarify, delegate_task, memory-provider tools), so those tool
calls still won't appear as observations in Langfuse even after this PR.

This is still a fix for all other tools besides these few built-in tools which are fixed by #12928.

I just didn't want to duplicate that PR here since it fixes a specific but related issue.

This PR fixes two independent observability gaps in the Langfuse plugin:

LLM generation inputs were empty and outputs missing the message on_pre_llm_request read only the messages kwarg, but depending on Hermes version and
call path the actual request payload arrives as request_messages, conversation_history, or user_message. Generations showed
an empty input instead of the real outbound message list.
Tool observations had no output. The Tool: child span in Langfuse showed input (arguments) but Output: undefined.
Root cause: when Hermes fires pre_tool_call and post_tool_call without a tool_call_id (the common case for most built-in and
registry tools), the plugin stored the observation under a unique time.time_ns()-based key in on_pre_tool_call that
on_post_tool_call could never reconstruct. The post hook fell back to popitem() which is unreliable under concurrent tool
calls. Every tool span ended up closed by the _finish_trace sweep with no output set.

Related Issue

Fixes #22342

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

plugins/observability/langfuse/init.py

FIFO queue for empty tool_call_id. TraceState gains a pending_tools_by_name: Dict[str, list] field. on_pre_tool_call
routes observations into the id-keyed tools dict when a tool_call_id is present, and into a per-name FIFO queue otherwise.
on_post_tool_call mirrors this: pop by id first, then drain the front of the per-name queue. This guarantees pre/post always
match the same observation in the order calls were dispatched. _finish_trace drains any leftover entries in
pending_tools_by_name so no spans are left hanging.
_coerce_request_messages helper. Picks the first available list from request_messages → messages → conversation_history,
falling back to a synthetic [{role: user, content: user_message}]. Used in on_pre_llm_request so the generation's input
reflects the real outbound payload regardless of which kwarg the runtime used.
Tool call serialization. _serialize_tool_calls now emits the OpenAI-style nested function: {name, arguments} shape
alongside the flat fields (Langfuse's UI uses the nested form for rendering). arguments is kept as a string rather than
parsed to a dict, matching what Langfuse expects.
Tool message serialization. _serialize_messages now preserves name alongside tool_call_id on role: tool messages, which
Langfuse needs to pair results with their tool calls.
Result backfill. In on_post_tool_call, when a tool_call_id is present the matching entry in state.turn_tool_calls is
updated with the result so the generation-level tool-call record carries the output, not just the arguments.

run_agent.py / model_tools.py (depends on #12928)

Removes _fire_post_tool_call_hook and all its call sites from the built-in tool branches (todo, session_search, memory,
clarify, delegate_task, memory-provider tools). Removes the skip_post_tool_call_hook parameter from handle_function_call.
Hook emission for these tools is now handled by the centralised mechanism in fix(hooks): fire post_tool_call hook for built-in tools (closes #12922) #12928.

How to Test

tests/plugins/test_langfuse_plugin.py — new TestToolObservationKeying class covers: single empty-tool_call_id round-trip
sets output; concurrent same-name tools are consumed FIFO without swapping outputs; non-empty tool_call_id path is
unaffected. New TestToolCallOutputBackfill covers backfill, tool-message serialization, and OpenAI tool-call shape. New
TestRequestMessageCoercion covers all four kwarg variants.
tests/run_agent/test_run_agent.py — updated to reflect removal of skip_post_tool_call_hook.

Steps:

Configure langfuse
In a new session, ask hermes to perform a tool call: "Use a tool to fetch example.com"
Check langfuse and observe that the LLM input and output and the tool input and output are now present.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform:

Documentation & Housekeeping

[N/A] I've updated relevant documentation (README, docs/, docstrings) — or N/A
[N/A] I've updated cli-config.yaml.example if I added/changed config keys — or N/A
[N/A] I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
[N/A] I've updated tool descriptions/schemas if I changed tool behavior — or N/A

on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input.

Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message)

zh820952173 · 2026-05-12T14:23:34Z

Quickly!!!!!!

@btorresgil

…#26319) PR #22345 by @btorresgil authors commits as 'Brian Conklin <brian@dralth.com>' (git config carries a different name/email than the GitHub account). GitHub's commit-author mapping correctly attributes these commits to @btorresgil based on the public-key registration, but Hermes' release attribution audit reads the raw commit email, not the GitHub mapping. Without this AUTHOR_MAP entry, salvaging #22345 would fail `scripts/contributor_audit.py` strict mode at release time. Prerequisite for the langfuse trace fix salvage that cherry-picks @btorresgil's commits onto current main.

…placeholder credentials (closes #22342, #22763) (#26320) * fix(langfuse): reject placeholder credentials with one-shot warning When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY at a template value like 'placeholder', 'test-key', or 'your-langfuse-key', the Langfuse SDK silently accepts the credentials at construction time and drops every trace at flush time. No warning, no error — just an empty Langfuse dashboard the operator only notices hours later. Add prefix-based validation in _get_langfuse() against the documented 'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side. Anything else fires a single warning naming the offending env var(s) with a log-safe value preview (full string for short placeholders so the operator knows which template they left in place; truncated for long values so a real secret pasted into the wrong field never hits the log), then short-circuits via the existing _INIT_FAILED cache so the warning fires once per process, not once per hook invocation. The check sits after the 'Langfuse is None' SDK-installed guard so hosts without the optional langfuse SDK don't see misleading 'set real keys' hints when the actionable fix is 'pip install langfuse'. Missing credentials remains the documented opt-out path and stays silent — no log noise for unconfigured installs. Fixes #22763 Fixes #23823 * fix(langfuse): use actual API request messages for generation input on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input. * fix(langfuse): record tool call outputs in traces Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message) * fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test Self-review of the combined #22345 + #23831 salvage surfaced three issues worth fixing in the same PR rather than as follow-ups: 1. Drop is_first_turn from the pre_api_request hook. The boolean expression `not bool(conversation_history)` was wrong: conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py), so the value flips False -> True mid-conversation on every post-compression API call. The langfuse plugin never consumed it, so the kwarg was both misleading AND dead. 2. Replace copy.deepcopy(request_messages) with shallow list() copy. The pre_api_request hook contract discards return values (invoke_hook never writes back to api_kwargs), and the langfuse plugin's _serialize_messages already builds its own snapshot dicts via _safe_value. A deepcopy on every API call would walk every tool result and base64 image — significant overhead for no real isolation benefit. Shallow copy of the outer list protects against later mutations of api_messages without paying for the inner-dict walk. 3. Rename test_empty_tool_call_id_concurrent_fifo_order -> test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8 threads behind a barrier to actually exercise _STATE_LOCK on the pending_tools_by_name queue. The original test was sequential and only validated Python list semantics; this one validates the lock discipline. 4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED — that function does not exist. Tests reload the module via sys.modules.pop + importlib.import_module instead. Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass. --------- Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Brian Conklin <brian@dralth.com>

kshitijk4poor · 2026-05-15T12:05:23Z

Thanks @btorresgil — superseded by #26320 which cherry-picks both of your commits with author preserved:

fix(langfuse): use actual API request messages for generation input
fix(langfuse): record tool call outputs in traces

The salvage combines these with @xxxigm's placeholder-credential fix (#23831) into one logical Langfuse fix landing on main. The dependency on #12928 mentioned in your PR body turned out to be already satisfied — post_tool_call is now centrally fired for all tools in model_tools.py:782-795.

Self-review polish on top (4th commit): removed a dead+buggy is_first_turn kwarg, replaced the copy.deepcopy(request_messages) with shallow list() (the hook contract doesn't allow mutation back to api_kwargs anyway), and added a real threaded FIFO test that actually exercises _STATE_LOCK instead of just Python list semantics.

Closing as superseded.

@btorresgil

…2345 salvage (NousResearch#26319) PR NousResearch#22345 by @btorresgil authors commits as 'Brian Conklin <brian@dralth.com>' (git config carries a different name/email than the GitHub account). GitHub's commit-author mapping correctly attributes these commits to @btorresgil based on the public-key registration, but Hermes' release attribution audit reads the raw commit email, not the GitHub mapping. Without this AUTHOR_MAP entry, salvaging NousResearch#22345 would fail `scripts/contributor_audit.py` strict mode at release time. Prerequisite for the langfuse trace fix salvage that cherry-picks @btorresgil's commits onto current main.

…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320) * fix(langfuse): reject placeholder credentials with one-shot warning When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY at a template value like 'placeholder', 'test-key', or 'your-langfuse-key', the Langfuse SDK silently accepts the credentials at construction time and drops every trace at flush time. No warning, no error — just an empty Langfuse dashboard the operator only notices hours later. Add prefix-based validation in _get_langfuse() against the documented 'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side. Anything else fires a single warning naming the offending env var(s) with a log-safe value preview (full string for short placeholders so the operator knows which template they left in place; truncated for long values so a real secret pasted into the wrong field never hits the log), then short-circuits via the existing _INIT_FAILED cache so the warning fires once per process, not once per hook invocation. The check sits after the 'Langfuse is None' SDK-installed guard so hosts without the optional langfuse SDK don't see misleading 'set real keys' hints when the actionable fix is 'pip install langfuse'. Missing credentials remains the documented opt-out path and stays silent — no log noise for unconfigured installs. Fixes NousResearch#22763 Fixes NousResearch#23823 * fix(langfuse): use actual API request messages for generation input on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input. * fix(langfuse): record tool call outputs in traces Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message) * fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues worth fixing in the same PR rather than as follow-ups: 1. Drop is_first_turn from the pre_api_request hook. The boolean expression `not bool(conversation_history)` was wrong: conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py), so the value flips False -> True mid-conversation on every post-compression API call. The langfuse plugin never consumed it, so the kwarg was both misleading AND dead. 2. Replace copy.deepcopy(request_messages) with shallow list() copy. The pre_api_request hook contract discards return values (invoke_hook never writes back to api_kwargs), and the langfuse plugin's _serialize_messages already builds its own snapshot dicts via _safe_value. A deepcopy on every API call would walk every tool result and base64 image — significant overhead for no real isolation benefit. Shallow copy of the outer list protects against later mutations of api_messages without paying for the inner-dict walk. 3. Rename test_empty_tool_call_id_concurrent_fifo_order -> test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8 threads behind a barrier to actually exercise _STATE_LOCK on the pending_tools_by_name queue. The original test was sequential and only validated Python list semantics; this one validates the lock discipline. 4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED — that function does not exist. Tests reload the module via sys.modules.pop + importlib.import_module instead. Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass. --------- Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Brian Conklin <brian@dralth.com>

@btorresgil

…2345 salvage (NousResearch#26319) PR NousResearch#22345 by @btorresgil authors commits as 'Brian Conklin <brian@dralth.com>' (git config carries a different name/email than the GitHub account). GitHub's commit-author mapping correctly attributes these commits to @btorresgil based on the public-key registration, but Hermes' release attribution audit reads the raw commit email, not the GitHub mapping. Without this AUTHOR_MAP entry, salvaging NousResearch#22345 would fail `scripts/contributor_audit.py` strict mode at release time. Prerequisite for the langfuse trace fix salvage that cherry-picks @btorresgil's commits onto current main.

…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320) * fix(langfuse): reject placeholder credentials with one-shot warning When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY at a template value like 'placeholder', 'test-key', or 'your-langfuse-key', the Langfuse SDK silently accepts the credentials at construction time and drops every trace at flush time. No warning, no error — just an empty Langfuse dashboard the operator only notices hours later. Add prefix-based validation in _get_langfuse() against the documented 'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side. Anything else fires a single warning naming the offending env var(s) with a log-safe value preview (full string for short placeholders so the operator knows which template they left in place; truncated for long values so a real secret pasted into the wrong field never hits the log), then short-circuits via the existing _INIT_FAILED cache so the warning fires once per process, not once per hook invocation. The check sits after the 'Langfuse is None' SDK-installed guard so hosts without the optional langfuse SDK don't see misleading 'set real keys' hints when the actionable fix is 'pip install langfuse'. Missing credentials remains the documented opt-out path and stays silent — no log noise for unconfigured installs. Fixes NousResearch#22763 Fixes NousResearch#23823 * fix(langfuse): use actual API request messages for generation input on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input. * fix(langfuse): record tool call outputs in traces Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message) * fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues worth fixing in the same PR rather than as follow-ups: 1. Drop is_first_turn from the pre_api_request hook. The boolean expression `not bool(conversation_history)` was wrong: conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py), so the value flips False -> True mid-conversation on every post-compression API call. The langfuse plugin never consumed it, so the kwarg was both misleading AND dead. 2. Replace copy.deepcopy(request_messages) with shallow list() copy. The pre_api_request hook contract discards return values (invoke_hook never writes back to api_kwargs), and the langfuse plugin's _serialize_messages already builds its own snapshot dicts via _safe_value. A deepcopy on every API call would walk every tool result and base64 image — significant overhead for no real isolation benefit. Shallow copy of the outer list protects against later mutations of api_messages without paying for the inner-dict walk. 3. Rename test_empty_tool_call_id_concurrent_fifo_order -> test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8 threads behind a barrier to actually exercise _STATE_LOCK on the pending_tools_by_name queue. The original test was sequential and only validated Python list semantics; this one validates the lock discipline. 4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED — that function does not exist. Tests reload the module via sys.modules.pop + importlib.import_module instead. Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass. --------- Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Brian Conklin <brian@dralth.com>

btorresgil added 2 commits May 8, 2026 23:04

btorresgil mentioned this pull request May 9, 2026

[Bug]: Langfuse traces missing LLM input/output and tool output #22342

Closed

1 task

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/agent Core agent loop, run_agent.py, prompt builder labels May 9, 2026

btorresgil changed the title ~~Fix langfuse observability - llm input/output and~~ Fix langfuse observability - llm input/output and tool output missing May 9, 2026

btorresgil changed the title ~~Fix langfuse observability - llm input/output and tool output missing~~ fix(langfuse): fixes langfuse observability - llm input/output and tool output missing May 9, 2026

This was referenced May 15, 2026

chore(release): map brian@dralth.com to btorresgil #26319

Merged

fix(langfuse): complete observability fix — trace I/O, tool outputs, placeholder credentials (closes #22342, #22763) #26320

Merged

kshitijk4poor closed this May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(langfuse): fixes langfuse observability - llm input/output and tool output missing#22345

fix(langfuse): fixes langfuse observability - llm input/output and tool output missing#22345
btorresgil wants to merge 2 commits into
NousResearch:mainfrom
btorresgil:fix/langfuse

btorresgil commented May 9, 2026 •

edited

Loading

Uh oh!

zh820952173 commented May 12, 2026

Uh oh!

kshitijk4poor commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

btorresgil commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Uh oh!

zh820952173 commented May 12, 2026

Uh oh!

kshitijk4poor commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

btorresgil commented May 9, 2026 •

edited

Loading