Skip to content

fix(langfuse): complete observability fix — trace I/O, tool outputs, placeholder credentials (closes #22342, #22763)#26320

Merged
kshitijk4poor merged 4 commits into
NousResearch:mainfrom
kshitijk4poor:salvage/langfuse-trace-and-placeholder-fix
May 15, 2026
Merged

fix(langfuse): complete observability fix — trace I/O, tool outputs, placeholder credentials (closes #22342, #22763)#26320
kshitijk4poor merged 4 commits into
NousResearch:mainfrom
kshitijk4poor:salvage/langfuse-trace-and-placeholder-fix

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

What does this PR do?

Combines and supersedes two open community PRs (#22345 by @btorresgil and #23831 by @xxxigm) to deliver a complete Langfuse observability fix:

  • Bug A — Langfuse generation input was empty because on_pre_llm_request only read the messages kwarg, but the runtime delivers the payload as request_messages / conversation_history / user_message depending on call path. Fixed by _coerce_request_messages helper.
  • Bug B — Langfuse tool spans had Output: undefined because pre_tool_call / post_tool_call couldn't match observations when tool_call_id was empty (the common case for many built-in tools). Fixed by per-tool-name FIFO queue in state.pending_tools_by_name.
  • Bug C — Placeholder credentials like placeholder / test-key / your-langfuse-key were silently accepted by the SDK and dropped at flush time. Fixed by prefix-whitelist validator using documented pk-lf- / sk-lf- Langfuse prefixes with a one-shot warning and _INIT_FAILED cache.

Plus a self-review polish commit that removes a dead/buggy is_first_turn kwarg, replaces an expensive copy.deepcopy with shallow list() copy, adds a real threaded FIFO test, and fixes a stale comment.

Related Issues

Supersedes / Replaces

Attribution

Commits Author Source
fix(langfuse): use actual API request messages for generation input @btorresgil #22345
fix(langfuse): record tool call outputs in traces @btorresgil #22345
fix(langfuse): reject placeholder credentials with one-shot warning @xxxigm #23831 (4 commits squashed)
fix(langfuse): salvage-review polish @kshitijk4poor This PR (self-review fixes)

Hard dependency

🛑 Blocks on #26319 — that PR adds brian@dralth.com → btorresgil to AUTHOR_MAP. The salvage commits authored by @btorresgil use that email; without the mapping, release attribution audit will fail. #26319 must merge first.

Self-review polish (4th commit)

Adversarial self-review surfaced three real issues; rather than deferring to follow-ups they're fixed in the same PR:

  1. Removed dead+buggy is_first_turn kwarg. The original PR passed is_first_turn=(not bool(conversation_history)) to pre_api_request, but conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py). The boolean flipped False → True mid-conversation on every post-compression API call. The plugin never consumed it, so the kwarg was both misleading and dead — removed entirely.
  2. Replaced copy.deepcopy(request_messages) with shallow list(request_messages). The pre_api_request hook contract discards return values (no mutation can leak back to api_kwargs), and the plugin's _serialize_messages already builds its own snapshot dicts. The original deepcopy walked every tool result and base64 image on every API call for no isolation benefit; shallow copy of the outer list is sufficient.
  3. Renamed test_empty_tool_call_id_concurrent_fifo_order..._observations_are_fifo_within_tool_name and added a real test_threaded_post_calls_preserve_fifo_under_lock with 8 threads + barrier that actually exercises _STATE_LOCK on the FIFO queue. The original test was sequential and only validated Python list semantics, not the lock discipline.
  4. Fixed stale # Cleared by reset_cache_for_tests() comment on _INIT_FAILED — that function does not exist; tests reload the module instead.

Test plan

bash scripts/run_tests.sh tests/plugins/test_langfuse_plugin.py tests/run_agent/test_run_agent.py

Local results from the salvage worktree:

  • tests/plugins/test_langfuse_plugin.py: 37 passed (was 7 before this PR — adds 30 regression tests across TestPlaceholderKeyDetection, TestRequestMessageCoercion, TestToolCallOutputBackfill, TestToolObservationKeying)
  • tests/plugins/ overall: 658 passed (was 657; +1 for the new threaded FIFO test)
  • tests/run_agent/test_run_agent.py -k hook: 1 passed (the pre_api_request / post_api_request contract test; updated to assert the new request_messages / response / assistant_message fields)

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Salvage (combining + extending community PRs)

xxxigm and others added 4 commits May 15, 2026 17:03
When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY
at a template value like 'placeholder', 'test-key', or 'your-langfuse-key',
the Langfuse SDK silently accepts the credentials at construction time and
drops every trace at flush time. No warning, no error — just an empty
Langfuse dashboard the operator only notices hours later.

Add prefix-based validation in _get_langfuse() against the documented
'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side.
Anything else fires a single warning naming the offending env var(s)
with a log-safe value preview (full string for short placeholders so the
operator knows which template they left in place; truncated for long
values so a real secret pasted into the wrong field never hits the log),
then short-circuits via the existing _INIT_FAILED cache so the warning
fires once per process, not once per hook invocation.

The check sits after the 'Langfuse is None' SDK-installed guard so hosts
without the optional langfuse SDK don't see misleading 'set real keys'
hints when the actionable fix is 'pip install langfuse'. Missing
credentials remains the documented opt-out path and stays silent — no
log noise for unconfigured installs.

Fixes NousResearch#22763
Fixes NousResearch#23823
on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.
Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)
…ow-copy request_messages, real threaded FIFO test

Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues
worth fixing in the same PR rather than as follow-ups:

1. Drop is_first_turn from the pre_api_request hook. The boolean expression
   `not bool(conversation_history)` was wrong: conversation_history is
   reassigned to None mid-run after compression (5 sites in run_agent.py),
   so the value flips False -> True mid-conversation on every post-compression
   API call. The langfuse plugin never consumed it, so the kwarg was both
   misleading AND dead.

2. Replace copy.deepcopy(request_messages) with shallow list() copy. The
   pre_api_request hook contract discards return values (invoke_hook never
   writes back to api_kwargs), and the langfuse plugin's _serialize_messages
   already builds its own snapshot dicts via _safe_value. A deepcopy on every
   API call would walk every tool result and base64 image — significant
   overhead for no real isolation benefit. Shallow copy of the outer list
   protects against later mutations of api_messages without paying for the
   inner-dict walk.

3. Rename test_empty_tool_call_id_concurrent_fifo_order ->
   test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a
   real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8
   threads behind a barrier to actually exercise _STATE_LOCK on the
   pending_tools_by_name queue. The original test was sequential and only
   validated Python list semantics; this one validates the lock discipline.

4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED —
   that function does not exist. Tests reload the module via sys.modules.pop
   + importlib.import_module instead.

Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.
@btorresgil

Copy link
Copy Markdown
Contributor

Thanks @kshitijk4poor !

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 15, 2026
Wizarck added a commit to Wizarck/hermes-agent that referenced this pull request May 15, 2026
Matches the upstream/main HEAD merged in the previous commit. Notable
delta vs the prior pin: PR NousResearch#26320 "complete observability fix" lands in
plugins/observability/langfuse/__init__.py — auto-merged with our
overlay's application/consumer tag patch, no conflicts. Bonus: upstream
added a HERMES_LANGFUSE_PUBLIC_KEY/SECRET_KEY placeholder-prefix
validator that happens to validate exactly the env vars wired in PR #8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DIZ-admin pushed a commit to DIZ-admin/hermes-agent that referenced this pull request May 16, 2026
…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320)

* fix(langfuse): reject placeholder credentials with one-shot warning

When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY
at a template value like 'placeholder', 'test-key', or 'your-langfuse-key',
the Langfuse SDK silently accepts the credentials at construction time and
drops every trace at flush time. No warning, no error — just an empty
Langfuse dashboard the operator only notices hours later.

Add prefix-based validation in _get_langfuse() against the documented
'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side.
Anything else fires a single warning naming the offending env var(s)
with a log-safe value preview (full string for short placeholders so the
operator knows which template they left in place; truncated for long
values so a real secret pasted into the wrong field never hits the log),
then short-circuits via the existing _INIT_FAILED cache so the warning
fires once per process, not once per hook invocation.

The check sits after the 'Langfuse is None' SDK-installed guard so hosts
without the optional langfuse SDK don't see misleading 'set real keys'
hints when the actionable fix is 'pip install langfuse'. Missing
credentials remains the documented opt-out path and stays silent — no
log noise for unconfigured installs.

Fixes NousResearch#22763
Fixes NousResearch#23823

* fix(langfuse): use actual API request messages for generation input

on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.

* fix(langfuse): record tool call outputs in traces

Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)

* fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test

Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues
worth fixing in the same PR rather than as follow-ups:

1. Drop is_first_turn from the pre_api_request hook. The boolean expression
   `not bool(conversation_history)` was wrong: conversation_history is
   reassigned to None mid-run after compression (5 sites in run_agent.py),
   so the value flips False -> True mid-conversation on every post-compression
   API call. The langfuse plugin never consumed it, so the kwarg was both
   misleading AND dead.

2. Replace copy.deepcopy(request_messages) with shallow list() copy. The
   pre_api_request hook contract discards return values (invoke_hook never
   writes back to api_kwargs), and the langfuse plugin's _serialize_messages
   already builds its own snapshot dicts via _safe_value. A deepcopy on every
   API call would walk every tool result and base64 image — significant
   overhead for no real isolation benefit. Shallow copy of the outer list
   protects against later mutations of api_messages without paying for the
   inner-dict walk.

3. Rename test_empty_tool_call_id_concurrent_fifo_order ->
   test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a
   real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8
   threads behind a barrier to actually exercise _STATE_LOCK on the
   pending_tools_by_name queue. The original test was sequential and only
   validated Python list semantics; this one validates the lock discipline.

4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED —
   that function does not exist. Tests reload the module via sys.modules.pop
   + importlib.import_module instead.

Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.

---------

Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: Brian Conklin <brian@dralth.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…placeholder credentials (closes NousResearch#22342, NousResearch#22763) (NousResearch#26320)

* fix(langfuse): reject placeholder credentials with one-shot warning

When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY
at a template value like 'placeholder', 'test-key', or 'your-langfuse-key',
the Langfuse SDK silently accepts the credentials at construction time and
drops every trace at flush time. No warning, no error — just an empty
Langfuse dashboard the operator only notices hours later.

Add prefix-based validation in _get_langfuse() against the documented
'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side.
Anything else fires a single warning naming the offending env var(s)
with a log-safe value preview (full string for short placeholders so the
operator knows which template they left in place; truncated for long
values so a real secret pasted into the wrong field never hits the log),
then short-circuits via the existing _INIT_FAILED cache so the warning
fires once per process, not once per hook invocation.

The check sits after the 'Langfuse is None' SDK-installed guard so hosts
without the optional langfuse SDK don't see misleading 'set real keys'
hints when the actionable fix is 'pip install langfuse'. Missing
credentials remains the documented opt-out path and stays silent — no
log noise for unconfigured installs.

Fixes NousResearch#22763
Fixes NousResearch#23823

* fix(langfuse): use actual API request messages for generation input

on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.

* fix(langfuse): record tool call outputs in traces

Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)

* fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test

Self-review of the combined NousResearch#22345 + NousResearch#23831 salvage surfaced three issues
worth fixing in the same PR rather than as follow-ups:

1. Drop is_first_turn from the pre_api_request hook. The boolean expression
   `not bool(conversation_history)` was wrong: conversation_history is
   reassigned to None mid-run after compression (5 sites in run_agent.py),
   so the value flips False -> True mid-conversation on every post-compression
   API call. The langfuse plugin never consumed it, so the kwarg was both
   misleading AND dead.

2. Replace copy.deepcopy(request_messages) with shallow list() copy. The
   pre_api_request hook contract discards return values (invoke_hook never
   writes back to api_kwargs), and the langfuse plugin's _serialize_messages
   already builds its own snapshot dicts via _safe_value. A deepcopy on every
   API call would walk every tool result and base64 image — significant
   overhead for no real isolation benefit. Shallow copy of the outer list
   protects against later mutations of api_messages without paying for the
   inner-dict walk.

3. Rename test_empty_tool_call_id_concurrent_fifo_order ->
   test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a
   real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8
   threads behind a barrier to actually exercise _STATE_LOCK on the
   pending_tools_by_name queue. The original test was sequential and only
   validated Python list semantics; this one validates the lock discipline.

4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED —
   that function does not exist. Tests reload the module via sys.modules.pop
   + importlib.import_module instead.

Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.

---------

Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: Brian Conklin <brian@dralth.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

4 participants