Skip to content

fix(goals): force judge to use tool calls instead of JSON-text replies#23547

Merged
teknium1 merged 1 commit into
mainfrom
fix/goal-judge-tool-calls
May 11, 2026
Merged

fix(goals): force judge to use tool calls instead of JSON-text replies#23547
teknium1 merged 1 commit into
mainfrom
fix/goal-judge-tool-calls

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Replaces the goal judge's free-form JSON replies with forced tool calls. Two new tools — submit_checklist (Phase A) and update_checklist (Phase B) — are passed to the auxiliary client with tool_choice pinned to the right tool. Tool-call schemas are enforced server-side; JSON in content is not.

Why

Live test on google/gemini-3-flash-preview hit the consecutive-parse-failures auto-pause: the judge model kept returning empty or non-JSON content, which tripped ⏸ Goal paused — the judge model (3 turns) isn't returning the required JSON verdict. Same failure mode the original v3 design worried about; tool calls are the reliable path.

Mechanics

  • Phase Adecompose_goal calls the auxiliary client with one tool (submit_checklist) and tool_choice={"function":"submit_checklist"}. Reads the items list directly from tool_call.arguments.items.
  • Phase Bevaluate_checklist runs a tool loop with two tools (read_file for history inspection, update_checklist for the verdict). Each iteration forces a tool call. The loop exits when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is force-targeted).
  • Tool-choice fallback_call_judge_with_tool_choice tries forced→required→auto in order if a provider 400s on a particular shape.
  • Backstop — if a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch fallback so we never silently lose a checklist on transient hiccups.

Validation

Before After
Live test on gemini-3-flash-preview Auto-paused at 3 turns with 'judge model isn't returning JSON' Done in 2 turns, all 11 items completed with item-specific evidence
tests/hermes_cli/test_goals.py 63 passing 70 passing (7 new)

Agent log on the live re-run shows produced 11 checklist items via tool call (Phase A) and updates=11 new_items=0 (Phase B) — both clean tool-call paths, no JSON-content fallback fired.

Files

  • hermes_cli/goals.py — two new tool schemas (_JUDGE_SUBMIT_CHECKLIST_TOOL_SCHEMA, _JUDGE_UPDATE_CHECKLIST_TOOL_SCHEMA); _extract_tool_call / _serialize_assistant_tool_calls / _call_judge_with_tool_choice helpers; decompose_goal and evaluate_checklist rewritten around forced tool calls; _normalize_update_args (replaces the JSON parser at the apply layer with the same 1-based→0-based conversion + terminal-status filter); system prompts updated to instruct the judge to use the tools instead of replying with JSON.
  • tests/hermes_cli/test_goals.py — 7 new tests covering both happy path and the JSON-content backstop.

Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.

Two new tools the judge calls:

  - submit_checklist(items)  — Phase A, decompose
  - update_checklist(updates, new_items, reason) — Phase B, evaluate

Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).

Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
  auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
  call, the legacy JSON-text parsers stay around as a last-ditch
  backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
  layer; same 1-based→0-based conversion + terminal-status filter.

Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.

Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/goal-judge-tool-calls vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8129 on HEAD, 8129 on base (➖ 0)

🆕 New issues (3):

Rule Count
invalid-argument-type 3
First entries
run_agent.py:13303: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
run_agent.py:13306: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:7160: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`

✅ Fixed issues (3):

Rule Count
invalid-argument-type 3
First entries
run_agent.py:7160: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`
run_agent.py:13306: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:13303: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`

Unchanged: 4267 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@teknium1 teknium1 merged commit a63a2b7 into main May 11, 2026
13 of 16 checks passed
@teknium1 teknium1 deleted the fix/goal-judge-tool-calls branch May 11, 2026 03:51
teknium1 added a commit that referenced this pull request May 11, 2026
* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)"

This reverts commit a63a2b7.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)"

This reverts commit 4a080b1.

* Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)"

This reverts commit 404640a.
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
NousResearch#23547)

Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.

Two new tools the judge calls:

  - submit_checklist(items)  — Phase A, decompose
  - update_checklist(updates, new_items, reason) — Phase B, evaluate

Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).

Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
  auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
  call, the legacy JSON-text parsers stay around as a last-ditch
  backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
  layer; same 1-based→0-based conversion + terminal-status filter.

Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.

Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
…rch#23813)

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)"

This reverts commit a63a2b7.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)"

This reverts commit 4a080b1.

* Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)"

This reverts commit 404640a.
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
NousResearch#23547)

Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.

Two new tools the judge calls:

  - submit_checklist(items)  — Phase A, decompose
  - update_checklist(updates, new_items, reason) — Phase B, evaluate

Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).

Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
  auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
  call, the legacy JSON-text parsers stay around as a last-ditch
  backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
  layer; same 1-based→0-based conversion + terminal-status filter.

Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.

Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
…rch#23813)

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)"

This reverts commit d7d4d91.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)"

This reverts commit 0398de7.

* Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)"

This reverts commit b968856.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
NousResearch#23547)

Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.

Two new tools the judge calls:

  - submit_checklist(items)  — Phase A, decompose
  - update_checklist(updates, new_items, reason) — Phase B, evaluate

Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).

Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
  auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
  call, the legacy JSON-text parsers stay around as a last-ditch
  backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
  layer; same 1-based→0-based conversion + terminal-status filter.

Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.

Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…rch#23813)

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)"

This reverts commit a63a2b7.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)"

This reverts commit 4a080b1.

* Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)"

This reverts commit 404640a.
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
NousResearch#23547)

Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.

Two new tools the judge calls:

  - submit_checklist(items)  — Phase A, decompose
  - update_checklist(updates, new_items, reason) — Phase B, evaluate

Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).

Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
  auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
  call, the legacy JSON-text parsers stay around as a last-ditch
  backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
  layer; same 1-based→0-based conversion + terminal-status filter.

Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.

Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…rch#23813)

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)"

This reverts commit 4e224c0.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)"

This reverts commit f7865f8.

* Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)"

This reverts commit be5fc05.
AlexFoxD pushed a commit to AlexFoxD/hermes-agent that referenced this pull request May 21, 2026
NousResearch#23547)

Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.

Two new tools the judge calls:

  - submit_checklist(items)  — Phase A, decompose
  - update_checklist(updates, new_items, reason) — Phase B, evaluate

Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).

Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
  auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
  call, the legacy JSON-text parsers stay around as a last-ditch
  backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
  layer; same 1-based→0-based conversion + terminal-status filter.

Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.

Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
AlexFoxD pushed a commit to AlexFoxD/hermes-agent that referenced this pull request May 21, 2026
…rch#23813)

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)"

This reverts commit a63a2b7.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)"

This reverts commit 4a080b1.

* Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)"

This reverts commit 404640a.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
NousResearch#23547)

Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.

Two new tools the judge calls:

  - submit_checklist(items)  — Phase A, decompose
  - update_checklist(updates, new_items, reason) — Phase B, evaluate

Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).

Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
  auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
  call, the legacy JSON-text parsers stay around as a last-ditch
  backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
  layer; same 1-based→0-based conversion + terminal-status filter.

Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.

Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…rch#23813)

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)"

This reverts commit a63a2b7.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)"

This reverts commit 4a080b1.

* Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)"

This reverts commit 404640a.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant