Skip to content

fix(plugins): surface reasoning_content in Langfuse assistant message (#29482)#29486

Closed
briandevans wants to merge 3 commits into
NousResearch:mainfrom
briandevans:fix/langfuse-reasoning-content-fallback-29482
Closed

fix(plugins): surface reasoning_content in Langfuse assistant message (#29482)#29486
briandevans wants to merge 3 commits into
NousResearch:mainfrom
briandevans:fix/langfuse-reasoning-content-fallback-29482

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

What does this PR do?

The bundled observability/langfuse plugin's _serialize_assistant_message reads only the top-level reasoning attribute, so providers that emit chain-of-thought via the reasoning_content convention (LM Studio, Moonshot, Qwen3 thinking, DeepSeek) silently lose their reasoning in Langfuse observations even though Hermes captures it correctly. Every LLM call N shows reasoning: None for these providers.

Root cause: NormalizedResponse (agent/transports/types.py) stores the reasoning text in provider_data["reasoning_content"] and exposes it via the reasoning_content property (line 115). The serializer never reads that property — only the top-level reasoning field. Fall back to reasoning_content when reasoning is empty/missing so the property is read on the production code path the issue reporter hits.

Mirrors the existing reasoning resolution shape used downstream — top-level reasoning (Anthropic / Codex) wins when set, reasoning_content (OpenAI-compatible reasoning_content convention) fills in for the providers that use it.

Audited siblings: the other two output-construction sites in the same hook (post_llm_call string path at line 828, post_api_request summary path at line 833) do not receive an assistant message object and hardcode reasoning: None. No widening needed in this PR.

Related Issue

Fixes #29482

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • plugins/observability/langfuse/__init__.py_serialize_assistant_message falls back to getattr(message, "reasoning_content", None) when the top-level reasoning is empty. Comment explains the two reasoning conventions.
  • tests/plugins/test_langfuse_plugin.py — new TestSerializeAssistantMessageReasoning class with 5 cases: top-level precedence, fallback when reasoning is None, fallback when reasoning == "", end-to-end via real NormalizedResponse(provider_data={"reasoning_content": ...}), and both-missing returns None.

How to Test

  1. uv run --with pytest --with pytest-xdist --with pytest-asyncio --with pytest-timeout python3 -m pytest tests/plugins/test_langfuse_plugin.py::TestSerializeAssistantMessageReasoning -v
  2. Full file regression: uv run --with pytest --with pytest-xdist --with pytest-asyncio --with pytest-timeout python3 -m pytest tests/plugins/test_langfuse_plugin.py -v → 42 passed.
  3. Regression guard: reverting only the production change makes the 3 fallback tests fail with AssertionError: assert '' == 'qwen3 thinking' (and similar); restoring it makes them pass — confirming the tests actually exercise the new behavior.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(plugins):)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run focused tests for the touched code and all pass (42/42 in the langfuse file)
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.x

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (internal serializer)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) — N/A (pure-Python serializer, no platform paths)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

Before (current main):

# Langfuse "LLM call N" observation, LM Studio + Qwen3 thinking
output:
  content: "final answer …"
  reasoning: None     # ← lost
  tool_calls: [...]

After this PR:

output:
  content: "final answer …"
  reasoning: "the model's chain of thought …"   # ← surfaced
  tool_calls: [...]

Copilot AI review requested due to automatic review settings May 20, 2026 21:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates Langfuse message serialization to surface assistant “reasoning” across providers that store it under reasoning_content, and adds regression tests to cover precedence/fallback behavior.

Changes:

  • Update _serialize_assistant_message to prefer message.reasoning, then fall back to message.reasoning_content.
  • Add tests validating precedence, fallback for None/empty string, and the NormalizedResponse.provider_data["reasoning_content"] code path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/plugins/test_langfuse_plugin.py Adds regression tests for reasoning precedence and reasoning_content fallback behavior.
plugins/observability/langfuse/init.py Implements fallback to reasoning_content when reasoning is not set, so Langfuse observations include “thinking”.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Anthropic / Codex behaviour unchanged, then fall back so Langfuse's
# observation actually shows the thinking instead of `reasoning: None`.
reasoning = getattr(message, "reasoning", None)
if not reasoning:
Comment on lines 483 to 487
return {
"content": _safe_value(getattr(message, "content", None)),
"reasoning": _safe_value(getattr(message, "reasoning", None)),
"reasoning": _safe_value(reasoning),
"tool_calls": _serialize_tool_calls(getattr(message, "tool_calls", None)),
}
@swanhtet1992

Copy link
Copy Markdown

One case this doesn't catch: OpenRouter's unified format puts reasoning under reasoning_details (array of {type, summary|thinking|content|text} objects, exposed via the reasoning_details property at agent/transports/types.py:120). Both extract_content_or_reasoning in agent/auxiliary_client.py:4404 and _extract_reasoning in run_agent.py already cover this third source, so the plugin would be aligned with them by adding it.

Can send a follow-up PR for it once this merges, or fold it in here if easier.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 20, 2026
@briandevans

briandevans commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@swanhtet1992 Folded in via 90de53b03 — _serialize_assistant_message now mirrors agent.agent_runtime_helpers.extract_reasoning precedence exactly:

reasoningreasoning_contentreasoning_details (per-detail key order summary > thinking > content > text).

Covered by four new tests:

  • test_falls_back_to_reasoning_details_openrouter_unified — full {type, summary|thinking|content|text} walk with deterministic join order
  • test_reasoning_details_skips_non_dict_and_empty — non-dict entries / whitespace-only summary / detail with no recognised key / duplicate-stripping
  • test_top_level_reasoning_wins_over_reasoning_details — precedence guard
  • (plus the existing five for reasoning / reasoning_content)

Thanks for the precise pointer to agent/transports/types.py:120 and the sibling extractors — kept the diff aligned with both so the Langfuse observation no longer diverges from what the main agent loop actually surfaces.

@briandevans

Copy link
Copy Markdown
Contributor Author

CI audit — both test job failures are pre-existing baselines on clean origin/main (c6a380eb6), neither is in this PR's touched code:

Test Symptom Root cause on main
tests/plugins/web/test_web_search_provider_plugins.py::TestBundledPluginsRegister::test_all_seven_plugins_present_in_registry Left contains one more item: 'xai' Commit a0c031299 feat(web): add xAI Web Search provider plugin registered an 8th bundled web provider but the literal 7-plugin assertion (plus three parametrized 7-plugin lists in the same file) was never updated.
tests/hermes_cli/test_update_hangup_protection.py::TestInstallHangupProtection::test_wraps_stdout_and_stderr_with_mirror assert isinstance(sys.stdout, _UpdateOutputStream) is False under xdist Other tests in the same worker reload hermes_cli.main, leaving the module-imported class stale even though the live stream is the correct wrapper — isinstance fails across the reload boundary. Reproduces on clean origin/main under pytest -n auto; passes when the test runs alone.

Both are already addressed by @camaragon's #29224 (canonical fix covering both files). No action needed on this PR.

@briandevans briandevans force-pushed the fix/langfuse-reasoning-content-fallback-29482 branch from 90de53b to f80d83f Compare May 23, 2026 10:15
@briandevans briandevans force-pushed the fix/langfuse-reasoning-content-fallback-29482 branch from f80d83f to 552cb40 Compare May 24, 2026 14:15
@briandevans

Copy link
Copy Markdown
Contributor Author

@copilot Both review findings addressed in d953a4b18:

  • L482 (truthy guard): replaced if not reasoning: with a dedicated _reasoning_is_blank helper that only triggers fallback when reasoning is None or a blank/whitespace string. A provider that places a legitimately falsy non-string payload (0, False, {}, []) on reasoning is now preserved verbatim. New regression test test_falsy_non_string_reasoning_is_not_overwritten covers the empty-dict case.
  • L505 (CoT export gating): this PR doesn't change the export policy — the existing observation already serialised the top-level reasoning field unconditionally on every assistant message; the PR just mirrors that same policy for the reasoning_content and reasoning_details surfaces so OpenRouter/DeepSeek/LM Studio traffic is no longer silently dropped. A redaction/opt-out gate is a reasonable separate enhancement, but covering all three surfaces uniformly is outside the scope of this regression fix for [Bug]: Langfuse plugin shows reasoning: None for reasoning_content models (DeepSeek/Qwen/LM Studio convention) #29482.

CI test (1)/(6) failure is the unrelated tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_resize_escape_is_forwarded pty timeout — no overlap with touched files.

@briandevans briandevans force-pushed the fix/langfuse-reasoning-content-fallback-29482 branch from d953a4b to 6a5a881 Compare May 27, 2026 02:15
Providers that emit chain-of-thought via the `reasoning_content`
convention (LM Studio, Moonshot, Qwen3 thinking, DeepSeek) leave the
top-level `reasoning` field unset and stash the text on
`NormalizedResponse.provider_data["reasoning_content"]`, surfaced via
the property at agent/transports/types.py:115.

`_serialize_assistant_message` only read `reasoning`, so every Langfuse
`LLM call N` observation showed `reasoning: None` for these providers
even though Hermes captured the thinking fine (visible elsewhere in the
runtime via `last_reasoning` in run_agent.py). Fall back to
`reasoning_content` when `reasoning` is empty/missing so the property
is read on the production code path the issue reporter hits.

The other two output-construction sites in the same hook
(`post_llm_call` string path, `post_api_request` summary path) do not
receive an assistant message object and hardcode `reasoning: None`;
both are intentionally unchanged.

Fixes NousResearch#29482
…earch#29482)

Extend `_serialize_assistant_message` with a third fallback that walks
the `reasoning_details` array (OpenRouter's unified format,
`[{type, summary|thinking|content|text}, ...]`).  Without this branch,
OpenRouter-routed traffic still recorded `reasoning: None` even though
the structured payload was present on the NormalizedResponse.

Mirrors `agent.agent_runtime_helpers.extract_reasoning` precedence
exactly: `reasoning` > `reasoning_content` > `reasoning_details`
(per-detail key order `summary > thinking > content > text`), so
the Langfuse observation matches whatever the main agent loop and
`extract_content_or_reasoning` already surface.  Per @swanhtet1992's
review on NousResearch#29486.
Replace the truthy if not reasoning: cascade with an explicit
_reasoning_is_blank helper that only triggers on None or
blank-whitespace strings. The previous guard would have overwritten a
legitimately falsy non-string payload (0, False, {}, [])
that a provider may have placed in reasoning deliberately.

Addresses Copilot review on NousResearch#29486.
@briandevans briandevans force-pushed the fix/langfuse-reasoning-content-fallback-29482 branch from 6a5a881 to 26c34a5 Compare May 29, 2026 04:10
@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to focus the queue on security/file-safety work where civilian merges are landing. Happy to reopen if maintainers want this picked up.

@briandevans briandevans closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Langfuse plugin shows reasoning: None for reasoning_content models (DeepSeek/Qwen/LM Studio convention)

4 participants