Skip to content

fix(response): recover on finish_reason=tool_calls with empty list (closes #99)#108

Merged
PowerCreek merged 1 commit into
mainfrom
response-handler-finish-reason-tools
May 26, 2026
Merged

fix(response): recover on finish_reason=tool_calls with empty list (closes #99)#108
PowerCreek merged 1 commit into
mainfrom
response-handler-finish-reason-tools

Conversation

@PowerCreek

Copy link
Copy Markdown

Summary

Closes #99 (unblocks #100 R12). When the model emits a tool-call-specific finish reason (tool_calls / function_call) but the normalized assistant_message.tool_calls reaches the response handler empty, the previous behavior burned 3 empty-retries before surfacing (empty). This adds a synthetic recovery branch keyed on tool-call finish reasons — same shape as #67's structural recovery but for the tool-call case.

Background

Per #99: hermes-cli reports "returned no content" + retries when an upstream OpenAI-compat response has message.content == "" AND message.tool_calls != []. Investigation traced this to an SDK-normalization gap — when raw message.tool_calls doesn't make it cleanly to the NormalizedResponse.tool_calls shape (or the model emits an unparseable function-call form), the else branch at conversation_loop.py:3493 is entered with content empty + tool_calls falsy, and _truly_empty triggers the retry loop.

Devagentic#295 currently works around this by stripping tools from the request boundary — once that workaround lifts (post-redeploy), this PR's branch catches any remaining edge cases.

Fix shape

_finish_wants_tools = finish_reason in {"tool_calls", "function_call"}
_tool_calls_absent = not (assistant_message.tool_calls or [])
if _finish_wants_tools and _tool_calls_absent and not _already_handled:
    # WARN log + status emit + synthetic recovery prompt

One-shot semantics via _finish_reason_tools_handled flag (mirrors #67's _tools_empty_terminal_handled).

Test plan

  • 9 new tests in test_finish_reason_tools_recovery.py: positive triggers for tool_calls + function_call with None/empty tool_calls; negative for stop/length/content_filter finish reasons; negative when tool_calls populated; one-shot already-handled gate; source-level patch-landed check
  • 28 total green across affected suites — no regression
  • After redeploy: poly-explorer's enumeration prompt ("Name three polynomial transformations.") no longer produces 3 silent retries before (empty)

Composition

Unblocks #100 (R12 run_pipeline / propose_pipeline MCP tools) — that issue explicitly recommended "Stay Backlog until hermes#99 lands". Will pick up #100 next iteration.

Devagentic#295 (tools-strip at boundary) can be lifted once this PR + a redeploy is in place; the client-side fix is now the authoritative path.

…loses #99)

When the model emits a tool-call-specific finish reason
(``tool_calls`` / ``function_call``) but the normalized
``assistant_message.tool_calls`` reaches the response handler empty
(SDK normalization gap, upstream shape issue, etc.), the previous
behavior:

1. Line 3180 ``if assistant_message.tool_calls:`` evaluates falsy
2. Falls into the ``else: # No tool calls`` branch
3. ``final_response = ""`` (empty content)
4. ``_truly_empty`` becomes True; ``finish_reason="tool_calls"``
   is NOT ``"stop"`` so the #67 structural-empty recovery doesn't
   apply
5. ``_empty_content_retries`` burns 3 retries returning the same
   empty shape
6. Falls through to ``"(empty)"`` terminal

This produces the user-visible "returned no content" + retry loop
documented in hermes-agent#99. Devagentic-side workaround NousResearch#295
strips tools at the request boundary to avoid the model emitting
the problematic finish-reason — once that workaround lifts, this
branch catches any remaining edge cases.

## Fix

In the else branch at conversation_loop.py:3493, before the
empty-retry logic, check:

  _finish_wants_tools = finish_reason in {"tool_calls", "function_call"}
  _tool_calls_absent = not (assistant_message.tool_calls or [])

When both hold (and ``_finish_reason_tools_handled`` flag isn't
set — one-shot semantics like the #67 path), surface synthetic
recovery instead of the retry loop:

- Emit WARNING log with the diagnostic shape (finish_reason,
  model, provider, response_id) so operators see the structural
  mismatch
- Emit user-visible status (``⚠️ Model emitted
  finish_reason=tool_calls but no tool_calls parsed — surfacing
  recovery prompt``)
- Append synthetic assistant message + user-level recovery prompt
  that tells the model the shape was broken + how to retry

Composes with #67's structural-empty recovery (different
finish_reason gates them).

## Tests

- 9 new tests in
  ``tests/agent/test_finish_reason_tools_recovery.py``: positive
  for ``tool_calls`` + ``function_call`` finish reasons with None
  / empty-list tool_calls (3); negative for ``stop`` / ``length``
  / ``content_filter`` finish reasons (3); negative when
  tool_calls populated (1); already-handled one-shot (1);
  source-level patch-landed check (1).
- 28 total green across affected suites (new + #67 terminal +
  empty-response-diagnostic). No regression.

## Composition

Unblocks hermes-agent#100 (R12 worker tool surface for
``run_pipeline`` / ``propose_pipeline``) — that issue's
"Recommendation: Stay Backlog until hermes#99 lands" gate
clears with this PR.

Devagentic#295 (tools-strip at boundary) can be lifted once this
PR + a redeploy is in place; the client-side fix is now the
authoritative path.
@PowerCreek PowerCreek merged commit 0b24be0 into main May 26, 2026
@PowerCreek PowerCreek deleted the response-handler-finish-reason-tools branch May 26, 2026 22:45
PowerCreek added a commit that referenced this pull request May 26, 2026
…nd (closes #110) (#111)

Strict OpenAI-compat validators (Groq, Mistral, Fireworks) reject any
unknown property with HTTP 400. Hermes-side bookkeeping markers
(``_thinking_prefill``, ``_empty_recovery_synthetic``,
``_empty_terminal_sentinel``) were leaking through to upstream because
only ``_thinking_prefill`` was individually stripped.

Reproducible on Groq from v0.16.0 (#69's ``_empty_recovery_synthetic``)
but the bug only surfaced widely after v0.17.1 (#108) expanded the
recovery path to ``finish_reason=tool_calls`` empty-list — now the
marker leak fires on every Groq tool-call response.

## Fix

Replace the individual ``api_msg.pop("_thinking_prefill", None)`` in
the API-message preparation loop with a generic leading-underscore
strip:

  for _k in [k for k in api_msg if isinstance(k, str) and k.startswith("_")]:
      api_msg.pop(_k, None)

The leading-underscore convention is stable for hermes-internal
fields; anything that genuinely needs to ride the API has a
documented OpenAI shape (``role`` / ``content`` / ``tool_calls`` /
``tool_call_id`` / ``name`` / ``reasoning_content``). Future internal
markers added with ``_`` prefix get auto-stripped without further
maintenance.

## Tests

- 9 new tests in tests/agent/test_internal_marker_stripping.py:
  strips each known marker (3); strips multiple in one pass (1);
  preserves standard OpenAI shape fields (1); handles
  no-markers and empty-message edge cases (2); non-string keys
  pass through (1); source-level patch-landed check (1).
- 28 total green across affected suites (new + #108 finish-reason
  recovery + #67 structural terminal). No regression.

## Composition

Without this fix, devagentic#295 (tools-strip at boundary) MUST stay
in place — lifting it makes the model emit tool_calls finish reasons
which trigger #108's recovery which now writes API-incompatible
markers. With this fix, the recovery path produces clean OpenAI-shape
messages on the wire.
PowerCreek added a commit that referenced this pull request May 26, 2026
…) (#113)

Wraps devagentic's ``executeWorkflowPipeline`` (NousResearch#258) +
``writeWorkflowPipeline`` (NousResearch#261) mutations as MCP tools so workers
can compose + execute pipelines directly through their tool
surface. R12 of devagentic#210 (flow-aware router umbrella).

## Tools registered

| Tool | Wraps | Returns |
|---|---|---|
| ``run_pipeline(pipeline_id, user_id)`` | ``executeWorkflowPipeline`` | ``PipelineRun {runId, status, nodeOutcomes, ...}`` |
| ``propose_pipeline(name, intent, version, nodes, edges, user_id, lineage_ref?, produced_by?)`` | ``writeWorkflowPipeline`` | ``Doc {id, content, tags, source, ts}`` |

Composes naturally: ``propose_pipeline → doc_id → run_pipeline``.

## Devagentic-side resolver verification

Schemas read directly from devagentic ``api.py`` via gh api:

- ``executeWorkflowPipeline(pipelineId: String!, userId: String!) -> PipelineRun`` (api.py:21628)
- ``writeWorkflowPipeline(name, intent, version, nodes, edges, userId, lineageRef?, producedBy?) -> Doc`` (api.py:21576)

Arg-name camelCase conversion (``pipeline_id`` → ``pipelineId``,
``lineage_ref`` → ``lineageRef``, etc.) matches strawberry's default.

Note: #100's body referenced ``ShadowNodeRef`` as the propose_pipeline
return — the actual mutation returns ``Doc``. Implementing the real
shape.

## Tests

- 10 new client-level tests in
  ``tests/plugins/devagentic-mutations/test_client.py``:
  * ``run_pipeline``: empty-args fail-soft (1), happy path with arg
    threading + nested nodeOutcomes (1), non-dict fail-soft (1).
  * ``propose_pipeline``: empty-args fail-soft (3),
    invalid-version-type fail-soft (3 parametrized), invalid
    nodes/edges-type fail-soft (2), happy path with all args
    threaded (1), optional-args-null-when-omitted (1), non-dict
    fail-soft (1).
- Contract test in
  ``tests/test_mcp_defensive_registrars.py`` extended:
  ``run_pipeline`` + ``propose_pipeline`` added to the required-G2/R12
  tool set (now 17 named devagentic tools).
- 184 total green across affected suites — no regression.

## Composition

Unblocks the worker-side pipeline-economy use case in
devagentic#210 R12. Workers can now author + execute pipelines
without shelling out to ``python api.py``. The recent #99 / #110
fixes (PRs #108 + #111) ensure tool_calls with empty content +
populated tool_calls on Groq/strict-validator paths don't cascade
through the empty-retry loop, so R12 tools are usable from
client-tier dispatches.
PowerCreek added a commit that referenced this pull request May 27, 2026
…#121) (#122)

OpenAI-spec ``tool_call.type`` is required with exactly one valid
value (``"function"``). Mistral-large (and likely others) emit
tool_calls without the ``type`` field. The OpenAI Python SDK's
strict Pydantic validation drops these entries from
``response.choices[0].message.tool_calls`` — hermes then sees an
empty list + ``finish_reason=tool_calls`` and falls into PR #108's
synthetic-recovery path, where the tool the model wanted to call
is lost entirely.

## Fix

Add raw-response fallback in
``ChatCompletionsTransport.normalize_response``:

- When the SDK-parsed ``msg.tool_calls`` is empty AND
  ``finish_reason in {"tool_calls", "function_call"}``, inspect
  the raw response via ``response.model_dump()`` (Pydantic v2),
  attribute walk, or dict shape.
- Iterate the raw tool_calls list, default missing ``type`` to
  ``"function"`` per spec semantics (the field has only ONE valid
  value), and synthesize ``ToolCall`` entries.
- ``ToolCall``'s existing ``type`` property already returns
  ``"function"`` unconditionally, so downstream consumers see the
  recovered call as identical to a normally-parsed one.

Strictly additive: only fires when (a) SDK gave us nothing AND (b)
the finish_reason signals tools. The SDK-happy-path is untouched.

## Implementation detail

Three helpers in ``chat_completions.py``:

- ``_DEFAULT_TOOL_CALL_TYPE = "function"`` constant
- ``_raw_tool_calls_from_response(response)`` — extracts the raw
  tool_calls list via model_dump / dict; fail-soft on any errors
- ``_normalize_raw_tool_call(tc_raw)`` — builds a ``ToolCall``
  from a raw dict entry; returns ``None`` for unrecoverable
  entries (no function.name); re-serializes dict-arguments back
  to JSON string per OpenAI contract

## Tests

- 17 new tests in tests/agent/test_tool_call_type_default.py:
  * 6 ``_normalize_raw_tool_call`` tests: with-type, without-type-
    defaults-function, no-function-block, no-name, not-dict,
    dict-arguments-reserialized, empty-arguments
  * 6 ``_raw_tool_calls_from_response`` tests: model_dump path,
    dict path, no-tool_calls, no-choices, model_dump-raises,
    non-dict-entries-filtered
  * 3 end-to-end normalize_response tests: mistral-shape recovery,
    no-fallback-on-stop-finish_reason, no-double-up-when-sdk-parsed
  * 2 source-level patch-landed checks
- 76 total green across affected suites — no regression on the
  SDK-happy path or the #67/#99/#108 recovery families.

## Composition

Same family of bugs as #99 / #108 / #111 — response handler being
over-strict on the OpenAI spec. This is the next-uncovered case
(tool_calls present in wire response but stripped by SDK before
reaching us).

Devagentic#NN (the companion devagentic-side normalization at the
OAI-shim boundary) becomes redundant defense after this lands —
still harmless to keep.
PowerCreek added a commit that referenced this pull request May 27, 2026
…#124) (#125)

PR #122 / #121 added raw-response fallback for mistral-shaped
tool_calls (type field absent → SDK strips entry → empty
``msg.tool_calls``). The fallback was gated on
``finish_reason in {"tool_calls", "function_call"}`` — assuming the
SDK preserved finish_reason while dropping the tool_call entry.

Field-tested on duplex sandbox container: the SDK actually
normalizes finish_reason to ``stop`` when it strips the type-less
tool_call. The recovery NEVER FIRED on the actual mistral shape;
the structural-empty recovery (#69 / #67) fired instead and
surfaced the noisy "Your previous response was empty" message.

## Fix

Drop the finish_reason gate. Whenever the SDK gives empty
``msg.tool_calls`` but the raw response shape carries tool_calls,
recover them — wire-level evidence is authoritative. Additionally,
rewrite the normalized ``finish_reason`` to ``"tool_calls"`` so
downstream consumers (``conversation_loop.py:3180`` tool branch,
the structural-empty + finish_reason-tools guards) see a
consistent state.

## Tests

- Updated 1 test in
  ``tests/agent/test_tool_call_type_default.py``:
  ``test_normalize_recovers_when_sdk_normalized_finish_reason_to_stop``
  (previously asserted recovery DIDN'T fire on stop, now asserts
  it DOES — matches field-observed SDK behavior + the bug
  symptom). Also asserts finish_reason gets rewritten to
  ``tool_calls`` for downstream consistency.
- 49 total green across affected suites — no regression on the
  SDK-happy path or the #67/#99/#108/#118 recovery families.

## Composition

- #121 / PR #122 — original fix, gated finish_reason too narrowly
- **this PR** — drops the gate; recovers whenever raw has tool_calls
PowerCreek added a commit that referenced this pull request May 28, 2026
…MPTY_RECOVERY (refs #143) (#147)

T2 of the #143 thin-client refactor scope. Devagentic-side cascade
(NousResearch#324) + runaway detector (NousResearch#345-348) + exec-terminus (NousResearch#349-354)
now cover the empty-content recovery layer with full intent /
role / dispatch-trace context. Hermes-side recovery layered on top
caused 3×4 dispatch stacking (#118) + mode confusion + invisible
swallow points (#133 debug funnel).

This patch default-flips: the four hermes-side recovery paths are
short-circuited unless ``HERMES_LEGACY_EMPTY_RECOVERY`` is set to
a truthy value. Legacy users opt-in to keep the pre-T2 behavior.

## Gated paths (all skipped when env unset)

| Path | Source | Loc |
|---|---|---|
| ``_finish_wants_tools`` synthetic recovery | PR #108 / #99 | line ~3618 |
| ``_post_tool_empty_retried`` nudge | pre-existing NousResearch#9400-class | line ~3749 |
| ``_structural_empty`` synthetic recovery | PR #69 / #67 | line ~3880 |
| 3-retry empty-content loop | pre-#67 | line ~3927 |

When env unset (default), empty responses fall through to either
the fallback-chain provider switch (if configured) or the clean
``(empty)`` terminal with ``_empty_terminal_sentinel=True``.

## Preserved through T2 (verified by test_t2_legacy_empty_recovery_gate.py)

- **PR #119 cascade_exhausted short-circuit** — hermes correctly
  deferring to devagentic's sentinel; NOT recovery. Source-level
  test asserts the ``if _cascade_err:`` block is NOT prefixed by
  ``_legacy_recovery_on``.
- **PR #122/#125 raw tool_calls fallback** — pre-recovery wire
  parsing in transports/chat_completions.py. Source-level test
  asserts the helper name doesn't appear in conversation_loop
  (lives elsewhere; untouched).
- **PR #131/#136/#138/#141 diagnostics** — env-gated via
  ``HERMES_DIAG_RAW_CAPTURE`` (#140), independent of this env.

## Operator deploy

Default behavior changes: empty responses surface cleanly (no
synthetic re-prompt). To preserve pre-T2 behavior:

```bash
export HERMES_LEGACY_EMPTY_RECOVERY=1
```

The legacy escape hatch is intended as a temporary safety net
while operators validate the thin-client architecture. Once the
devagentic-side cascade is universally deployed + observed to
cover all empty-content cases, the legacy gate can be removed in
a follow-up (the env var stays as a no-op for backward compat).

## Tests

- 23 new tests in
  ``tests/agent/test_t2_legacy_empty_recovery_gate.py``:
  resolver default-false / empty-false / 6 truthy / 6 falsy-or-
  unknown; source-level gate assertions for each of the four
  recovery branches; preserved-path assertions for cascade_exhausted
  + raw tool_calls fallback location; default-off + opt-in
  resolver round-trip.
- 163 total green across affected suites (T2 + existing empty-
  terminal mirror tests + finish_reason_tools_recovery +
  cascade_exhausted + internal_marker_stripping +
  tool_call_type_default + tool_use_enforcement + T3 + T1 persona
  + doctor persona probe).

## Composition

Per #143 sequencing:
- T3 / PR #145 (clarify default-out) — merged ✓
- T1 / PR #146 (persona default-flip) — merged ✓
- **T2 / this PR (empty-content recovery removal, env-gated)**
- Later: T4-T6 (tool list / iteration cap / summary fallback)

The legacy escape hatch design choice (vs. full removal) lets
operators roll back per-deployment if a previously-recovered edge
case surfaces in the field. Diagnostic from #140
(HERMES_DIAG_RAW_CAPTURE) remains the observability tool to spot
any uncovered case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: response handler ignores message.tool_calls — "returned no content" false-positive on tool-bearing prompts

1 participant