fix(langfuse): complete observability fix — trace I/O, tool outputs, placeholder credentials (closes #22342, #22763)#26320
Merged
kshitijk4poor merged 4 commits intoMay 15, 2026
Conversation
When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY at a template value like 'placeholder', 'test-key', or 'your-langfuse-key', the Langfuse SDK silently accepts the credentials at construction time and drops every trace at flush time. No warning, no error — just an empty Langfuse dashboard the operator only notices hours later. Add prefix-based validation in _get_langfuse() against the documented 'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side. Anything else fires a single warning naming the offending env var(s) with a log-safe value preview (full string for short placeholders so the operator knows which template they left in place; truncated for long values so a real secret pasted into the wrong field never hits the log), then short-circuits via the existing _INIT_FAILED cache so the warning fires once per process, not once per hook invocation. The check sits after the 'Langfuse is None' SDK-installed guard so hosts without the optional langfuse SDK don't see misleading 'set real keys' hints when the actionable fix is 'pip install langfuse'. Missing credentials remains the documented opt-out path and stays silent — no log noise for unconfigured installs. Fixes NousResearch#22763 Fixes NousResearch#23823
on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input.
Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message)
…ow-copy request_messages, real threaded FIFO test Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues worth fixing in the same PR rather than as follow-ups: 1. Drop is_first_turn from the pre_api_request hook. The boolean expression `not bool(conversation_history)` was wrong: conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py), so the value flips False -> True mid-conversation on every post-compression API call. The langfuse plugin never consumed it, so the kwarg was both misleading AND dead. 2. Replace copy.deepcopy(request_messages) with shallow list() copy. The pre_api_request hook contract discards return values (invoke_hook never writes back to api_kwargs), and the langfuse plugin's _serialize_messages already builds its own snapshot dicts via _safe_value. A deepcopy on every API call would walk every tool result and base64 image — significant overhead for no real isolation benefit. Shallow copy of the outer list protects against later mutations of api_messages without paying for the inner-dict walk. 3. Rename test_empty_tool_call_id_concurrent_fifo_order -> test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8 threads behind a barrier to actually exercise _STATE_LOCK on the pending_tools_by_name queue. The original test was sequential and only validated Python list semantics; this one validates the lock discipline. 4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED — that function does not exist. Tests reload the module via sys.modules.pop + importlib.import_module instead. Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.
This was referenced May 15, 2026
Closed
Contributor
|
Thanks @kshitijk4poor ! |
Wizarck
added a commit
to Wizarck/hermes-agent
that referenced
this pull request
May 15, 2026
Matches the upstream/main HEAD merged in the previous commit. Notable delta vs the prior pin: PR NousResearch#26320 "complete observability fix" lands in plugins/observability/langfuse/__init__.py — auto-merged with our overlay's application/consumer tag patch, no conflicts. Bonus: upstream added a HERMES_LANGFUSE_PUBLIC_KEY/SECRET_KEY placeholder-prefix validator that happens to validate exactly the env vars wired in PR #8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
1 task
DIZ-admin
pushed a commit
to DIZ-admin/hermes-agent
that referenced
this pull request
May 16, 2026
…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320) * fix(langfuse): reject placeholder credentials with one-shot warning When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY at a template value like 'placeholder', 'test-key', or 'your-langfuse-key', the Langfuse SDK silently accepts the credentials at construction time and drops every trace at flush time. No warning, no error — just an empty Langfuse dashboard the operator only notices hours later. Add prefix-based validation in _get_langfuse() against the documented 'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side. Anything else fires a single warning naming the offending env var(s) with a log-safe value preview (full string for short placeholders so the operator knows which template they left in place; truncated for long values so a real secret pasted into the wrong field never hits the log), then short-circuits via the existing _INIT_FAILED cache so the warning fires once per process, not once per hook invocation. The check sits after the 'Langfuse is None' SDK-installed guard so hosts without the optional langfuse SDK don't see misleading 'set real keys' hints when the actionable fix is 'pip install langfuse'. Missing credentials remains the documented opt-out path and stays silent — no log noise for unconfigured installs. Fixes NousResearch#22763 Fixes NousResearch#23823 * fix(langfuse): use actual API request messages for generation input on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input. * fix(langfuse): record tool call outputs in traces Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message) * fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues worth fixing in the same PR rather than as follow-ups: 1. Drop is_first_turn from the pre_api_request hook. The boolean expression `not bool(conversation_history)` was wrong: conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py), so the value flips False -> True mid-conversation on every post-compression API call. The langfuse plugin never consumed it, so the kwarg was both misleading AND dead. 2. Replace copy.deepcopy(request_messages) with shallow list() copy. The pre_api_request hook contract discards return values (invoke_hook never writes back to api_kwargs), and the langfuse plugin's _serialize_messages already builds its own snapshot dicts via _safe_value. A deepcopy on every API call would walk every tool result and base64 image — significant overhead for no real isolation benefit. Shallow copy of the outer list protects against later mutations of api_messages without paying for the inner-dict walk. 3. Rename test_empty_tool_call_id_concurrent_fifo_order -> test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8 threads behind a barrier to actually exercise _STATE_LOCK on the pending_tools_by_name queue. The original test was sequential and only validated Python list semantics; this one validates the lock discipline. 4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED — that function does not exist. Tests reload the module via sys.modules.pop + importlib.import_module instead. Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass. --------- Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Brian Conklin <brian@dralth.com>
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320) * fix(langfuse): reject placeholder credentials with one-shot warning When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY at a template value like 'placeholder', 'test-key', or 'your-langfuse-key', the Langfuse SDK silently accepts the credentials at construction time and drops every trace at flush time. No warning, no error — just an empty Langfuse dashboard the operator only notices hours later. Add prefix-based validation in _get_langfuse() against the documented 'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side. Anything else fires a single warning naming the offending env var(s) with a log-safe value preview (full string for short placeholders so the operator knows which template they left in place; truncated for long values so a real secret pasted into the wrong field never hits the log), then short-circuits via the existing _INIT_FAILED cache so the warning fires once per process, not once per hook invocation. The check sits after the 'Langfuse is None' SDK-installed guard so hosts without the optional langfuse SDK don't see misleading 'set real keys' hints when the actionable fix is 'pip install langfuse'. Missing credentials remains the documented opt-out path and stays silent — no log noise for unconfigured installs. Fixes NousResearch#22763 Fixes NousResearch#23823 * fix(langfuse): use actual API request messages for generation input on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input. * fix(langfuse): record tool call outputs in traces Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message) * fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues worth fixing in the same PR rather than as follow-ups: 1. Drop is_first_turn from the pre_api_request hook. The boolean expression `not bool(conversation_history)` was wrong: conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py), so the value flips False -> True mid-conversation on every post-compression API call. The langfuse plugin never consumed it, so the kwarg was both misleading AND dead. 2. Replace copy.deepcopy(request_messages) with shallow list() copy. The pre_api_request hook contract discards return values (invoke_hook never writes back to api_kwargs), and the langfuse plugin's _serialize_messages already builds its own snapshot dicts via _safe_value. A deepcopy on every API call would walk every tool result and base64 image — significant overhead for no real isolation benefit. Shallow copy of the outer list protects against later mutations of api_messages without paying for the inner-dict walk. 3. Rename test_empty_tool_call_id_concurrent_fifo_order -> test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8 threads behind a barrier to actually exercise _STATE_LOCK on the pending_tools_by_name queue. The original test was sequential and only validated Python list semantics; this one validates the lock discipline. 4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED — that function does not exist. Tests reload the module via sys.modules.pop + importlib.import_module instead. Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass. --------- Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Brian Conklin <brian@dralth.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Combines and supersedes two open community PRs (#22345 by @btorresgil and #23831 by @xxxigm) to deliver a complete Langfuse observability fix:
inputwas empty becauseon_pre_llm_requestonly read themessageskwarg, but the runtime delivers the payload asrequest_messages/conversation_history/user_messagedepending on call path. Fixed by_coerce_request_messageshelper.Output: undefinedbecausepre_tool_call/post_tool_callcouldn't match observations whentool_call_idwas empty (the common case for many built-in tools). Fixed by per-tool-name FIFO queue instate.pending_tools_by_name.placeholder/test-key/your-langfuse-keywere silently accepted by the SDK and dropped at flush time. Fixed by prefix-whitelist validator using documentedpk-lf-/sk-lf-Langfuse prefixes with a one-shot warning and_INIT_FAILEDcache.Plus a self-review polish commit that removes a dead/buggy
is_first_turnkwarg, replaces an expensivecopy.deepcopywith shallowlist()copy, adds a real threaded FIFO test, and fixes a stale comment.Related Issues
Supersedes / Replaces
fix(langfuse): use actual API request messages for generation input+fix(langfuse): record tool call outputs in traces) with author preserved. Dropped the original PR body's claim that this depends on fix(hooks): fire post_tool_call hook for built-in tools (closes #12922) #12928 — that change was already centralized on main (model_tools.pyalready firespost_tool_callfor all tools viahandle_function_call).your-langfuse-key,<your-key>,REPLACE_ME,change-me,xxx. Fix/langfuse placeholder key warning #23831's prefix-whitelist approach is unambiguously better. Will be closed with credit pointing here..envsanitizer suffix-trap (GLM_API_KEYvsLM_API_KEY,HERMES_LANGFUSE_*vsLANGFUSE_*). The same bug class is already fixed onmainvia the overlap-range approach in_sanitize_env_lines(seehermes_cli/config.py:4400-4457, with an explicit code comment naming theLM_API_KEY/GLM_API_KEYcase). Will be closed with credit pointing to that landed fix.post_tool_callhook firing for built-in tools is already onmain(model_tools.py:782-795). Will be closed.Attribution
fix(langfuse): use actual API request messages for generation inputfix(langfuse): record tool call outputs in tracesfix(langfuse): reject placeholder credentials with one-shot warningfix(langfuse): salvage-review polishHard dependency
🛑 Blocks on #26319 — that PR adds
brian@dralth.com → btorresgiltoAUTHOR_MAP. The salvage commits authored by @btorresgil use that email; without the mapping, release attribution audit will fail. #26319 must merge first.Self-review polish (4th commit)
Adversarial self-review surfaced three real issues; rather than deferring to follow-ups they're fixed in the same PR:
is_first_turnkwarg. The original PR passedis_first_turn=(not bool(conversation_history))topre_api_request, butconversation_historyis reassigned toNonemid-run after compression (5 sites inrun_agent.py). The boolean flipped False → True mid-conversation on every post-compression API call. The plugin never consumed it, so the kwarg was both misleading and dead — removed entirely.copy.deepcopy(request_messages)with shallowlist(request_messages). Thepre_api_requesthook contract discards return values (no mutation can leak back toapi_kwargs), and the plugin's_serialize_messagesalready builds its own snapshot dicts. The original deepcopy walked every tool result and base64 image on every API call for no isolation benefit; shallow copy of the outer list is sufficient.test_empty_tool_call_id_concurrent_fifo_order→..._observations_are_fifo_within_tool_nameand added a realtest_threaded_post_calls_preserve_fifo_under_lockwith 8 threads + barrier that actually exercises_STATE_LOCKon the FIFO queue. The original test was sequential and only validated Python list semantics, not the lock discipline.# Cleared by reset_cache_for_tests()comment on_INIT_FAILED— that function does not exist; tests reload the module instead.Test plan
Local results from the salvage worktree:
tests/plugins/test_langfuse_plugin.py: 37 passed (was 7 before this PR — adds 30 regression tests acrossTestPlaceholderKeyDetection,TestRequestMessageCoercion,TestToolCallOutputBackfill,TestToolObservationKeying)tests/plugins/overall: 658 passed (was 657; +1 for the new threaded FIFO test)tests/run_agent/test_run_agent.py -k hook: 1 passed (thepre_api_request/post_api_requestcontract test; updated to assert the newrequest_messages/response/assistant_messagefields)Type of Change