fix(conversation): structural-empty terminal recovery for stop+no-tool_calls+tools (closes #67)#69
Merged
Merged
Conversation
…l_calls+tools (closes #67)
This was referenced May 24, 2026
Merged
PowerCreek
added a commit
that referenced
this pull request
May 26, 2026
…nd (closes #110) (#111) Strict OpenAI-compat validators (Groq, Mistral, Fireworks) reject any unknown property with HTTP 400. Hermes-side bookkeeping markers (``_thinking_prefill``, ``_empty_recovery_synthetic``, ``_empty_terminal_sentinel``) were leaking through to upstream because only ``_thinking_prefill`` was individually stripped. Reproducible on Groq from v0.16.0 (#69's ``_empty_recovery_synthetic``) but the bug only surfaced widely after v0.17.1 (#108) expanded the recovery path to ``finish_reason=tool_calls`` empty-list — now the marker leak fires on every Groq tool-call response. ## Fix Replace the individual ``api_msg.pop("_thinking_prefill", None)`` in the API-message preparation loop with a generic leading-underscore strip: for _k in [k for k in api_msg if isinstance(k, str) and k.startswith("_")]: api_msg.pop(_k, None) The leading-underscore convention is stable for hermes-internal fields; anything that genuinely needs to ride the API has a documented OpenAI shape (``role`` / ``content`` / ``tool_calls`` / ``tool_call_id`` / ``name`` / ``reasoning_content``). Future internal markers added with ``_`` prefix get auto-stripped without further maintenance. ## Tests - 9 new tests in tests/agent/test_internal_marker_stripping.py: strips each known marker (3); strips multiple in one pass (1); preserves standard OpenAI shape fields (1); handles no-markers and empty-message edge cases (2); non-string keys pass through (1); source-level patch-landed check (1). - 28 total green across affected suites (new + #108 finish-reason recovery + #67 structural terminal). No regression. ## Composition Without this fix, devagentic#295 (tools-strip at boundary) MUST stay in place — lifting it makes the model emit tool_calls finish reasons which trigger #108's recovery which now writes API-incompatible markers. With this fix, the recovery path produces clean OpenAI-shape messages on the wire.
PowerCreek
added a commit
that referenced
this pull request
May 27, 2026
…#124) (#125) PR #122 / #121 added raw-response fallback for mistral-shaped tool_calls (type field absent → SDK strips entry → empty ``msg.tool_calls``). The fallback was gated on ``finish_reason in {"tool_calls", "function_call"}`` — assuming the SDK preserved finish_reason while dropping the tool_call entry. Field-tested on duplex sandbox container: the SDK actually normalizes finish_reason to ``stop`` when it strips the type-less tool_call. The recovery NEVER FIRED on the actual mistral shape; the structural-empty recovery (#69 / #67) fired instead and surfaced the noisy "Your previous response was empty" message. ## Fix Drop the finish_reason gate. Whenever the SDK gives empty ``msg.tool_calls`` but the raw response shape carries tool_calls, recover them — wire-level evidence is authoritative. Additionally, rewrite the normalized ``finish_reason`` to ``"tool_calls"`` so downstream consumers (``conversation_loop.py:3180`` tool branch, the structural-empty + finish_reason-tools guards) see a consistent state. ## Tests - Updated 1 test in ``tests/agent/test_tool_call_type_default.py``: ``test_normalize_recovers_when_sdk_normalized_finish_reason_to_stop`` (previously asserted recovery DIDN'T fire on stop, now asserts it DOES — matches field-observed SDK behavior + the bug symptom). Also asserts finish_reason gets rewritten to ``tool_calls`` for downstream consistency. - 49 total green across affected suites — no regression on the SDK-happy path or the #67/#99/#108/#118 recovery families. ## Composition - #121 / PR #122 — original fix, gated finish_reason too narrowly - **this PR** — drops the gate; recovers whenever raw has tool_calls
PowerCreek
added a commit
that referenced
this pull request
May 28, 2026
…MPTY_RECOVERY (refs #143) (#147) T2 of the #143 thin-client refactor scope. Devagentic-side cascade (NousResearch#324) + runaway detector (NousResearch#345-348) + exec-terminus (NousResearch#349-354) now cover the empty-content recovery layer with full intent / role / dispatch-trace context. Hermes-side recovery layered on top caused 3×4 dispatch stacking (#118) + mode confusion + invisible swallow points (#133 debug funnel). This patch default-flips: the four hermes-side recovery paths are short-circuited unless ``HERMES_LEGACY_EMPTY_RECOVERY`` is set to a truthy value. Legacy users opt-in to keep the pre-T2 behavior. ## Gated paths (all skipped when env unset) | Path | Source | Loc | |---|---|---| | ``_finish_wants_tools`` synthetic recovery | PR #108 / #99 | line ~3618 | | ``_post_tool_empty_retried`` nudge | pre-existing NousResearch#9400-class | line ~3749 | | ``_structural_empty`` synthetic recovery | PR #69 / #67 | line ~3880 | | 3-retry empty-content loop | pre-#67 | line ~3927 | When env unset (default), empty responses fall through to either the fallback-chain provider switch (if configured) or the clean ``(empty)`` terminal with ``_empty_terminal_sentinel=True``. ## Preserved through T2 (verified by test_t2_legacy_empty_recovery_gate.py) - **PR #119 cascade_exhausted short-circuit** — hermes correctly deferring to devagentic's sentinel; NOT recovery. Source-level test asserts the ``if _cascade_err:`` block is NOT prefixed by ``_legacy_recovery_on``. - **PR #122/#125 raw tool_calls fallback** — pre-recovery wire parsing in transports/chat_completions.py. Source-level test asserts the helper name doesn't appear in conversation_loop (lives elsewhere; untouched). - **PR #131/#136/#138/#141 diagnostics** — env-gated via ``HERMES_DIAG_RAW_CAPTURE`` (#140), independent of this env. ## Operator deploy Default behavior changes: empty responses surface cleanly (no synthetic re-prompt). To preserve pre-T2 behavior: ```bash export HERMES_LEGACY_EMPTY_RECOVERY=1 ``` The legacy escape hatch is intended as a temporary safety net while operators validate the thin-client architecture. Once the devagentic-side cascade is universally deployed + observed to cover all empty-content cases, the legacy gate can be removed in a follow-up (the env var stays as a no-op for backward compat). ## Tests - 23 new tests in ``tests/agent/test_t2_legacy_empty_recovery_gate.py``: resolver default-false / empty-false / 6 truthy / 6 falsy-or- unknown; source-level gate assertions for each of the four recovery branches; preserved-path assertions for cascade_exhausted + raw tool_calls fallback location; default-off + opt-in resolver round-trip. - 163 total green across affected suites (T2 + existing empty- terminal mirror tests + finish_reason_tools_recovery + cascade_exhausted + internal_marker_stripping + tool_call_type_default + tool_use_enforcement + T3 + T1 persona + doctor persona probe). ## Composition Per #143 sequencing: - T3 / PR #145 (clarify default-out) — merged ✓ - T1 / PR #146 (persona default-flip) — merged ✓ - **T2 / this PR (empty-content recovery removal, env-gated)** - Later: T4-T6 (tool list / iteration cap / summary fallback) The legacy escape hatch design choice (vs. full removal) lets operators roll back per-deployment if a previously-recovered edge case surfaces in the field. Diagnostic from #140 (HERMES_DIAG_RAW_CAPTURE) remains the observability tool to spot any uncovered case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #67. Based on diagnostic data captured by PR #68 from a live polynomial-explorer failure:
finish_reason=stopconfirms the model is choosing to return nothing — not a content filter, not a length cut, not a tool call. The MCP tool schemas in the request overwhelmmistral-large(and likely other small models) and it elects to emit nothing. Retrying with the same input produces the same empty across all 3 attempts.What this does
Short-circuits the empty-content retry loop when ALL of:
_truly_empty(no content, no reasoning)finish_reason == "stop"prior_was_toolis False (separate_post_tool_empty_retriedpath handles the post-tool case)agent.toolsis non-empty (tools were attached to the request)_tools_empty_terminal_handledflag not yet set (one-shot per session)When triggered, instead of wasting 3 retries:
"(empty — model returned no content despite tools being attached; finish_reason=stop)"messagecontinues — next loop iteration sends the productive recovery prompt instead of re-trying the same inputSets
_tools_empty_terminal_handled = Trueso recovery itself can't loop. If the recovery prompt also produces empty, the existing 3-retry-then-fallback chain takes over.Why this shape over retry-without-tools
Both options the orchestrator proposed have merit. This PR ships (a) — terminal-with-synthetic — for these reasons:
_post_tool_empty_retriedpath at line ~3520 uses exactly this synthetic-assistant-then-synthetic-user shape. New trigger is symmetric.Test plan
tests/agent/test_empty_response_terminal.pycover the guard expression:tests/agent/test_empty_response_diagnostic.py(PR instrument(conversation): enrich empty-response WARN with response-shape diagnostics (#67 partial) #68) all 9 still pass — no regression to the diagnostic plumbing.Deploy
Hermes-side change → needs the same container rebuild as PR #68 to land on devbox (G6/#66). Operator restart of the devagentic service.py alone won't pick this up.
Companion PR
devagentic#219 — server-side
[empty-content]WARN. Captures the same failure pattern from the server vantage; ships in any service.py restart (no container rebuild needed). Both PRs are complementary — one captures from the wire, the other handles the recovery.