fix(goals): force judge to use tool calls instead of JSON-text replies by teknium1 · Pull Request #23547 · NousResearch/hermes-agent

teknium1 · 2026-05-11T03:49:16Z

Summary

Replaces the goal judge's free-form JSON replies with forced tool calls. Two new tools — submit_checklist (Phase A) and update_checklist (Phase B) — are passed to the auxiliary client with tool_choice pinned to the right tool. Tool-call schemas are enforced server-side; JSON in content is not.

Why

Live test on google/gemini-3-flash-preview hit the consecutive-parse-failures auto-pause: the judge model kept returning empty or non-JSON content, which tripped ⏸ Goal paused — the judge model (3 turns) isn't returning the required JSON verdict. Same failure mode the original v3 design worried about; tool calls are the reliable path.

Mechanics

Phase A — decompose_goal calls the auxiliary client with one tool (submit_checklist) and tool_choice={"function":"submit_checklist"}. Reads the items list directly from tool_call.arguments.items.
Phase B — evaluate_checklist runs a tool loop with two tools (read_file for history inspection, update_checklist for the verdict). Each iteration forces a tool call. The loop exits when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is force-targeted).
Tool-choice fallback — _call_judge_with_tool_choice tries forced→required→auto in order if a provider 400s on a particular shape.
Backstop — if a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch fallback so we never silently lose a checklist on transient hiccups.

Validation

	Before	After
Live test on gemini-3-flash-preview	Auto-paused at 3 turns with 'judge model isn't returning JSON'	Done in 2 turns, all 11 items completed with item-specific evidence
tests/hermes_cli/test_goals.py	63 passing	70 passing (7 new)

Agent log on the live re-run shows produced 11 checklist items via tool call (Phase A) and updates=11 new_items=0 (Phase B) — both clean tool-call paths, no JSON-content fallback fired.

Files

hermes_cli/goals.py — two new tool schemas (_JUDGE_SUBMIT_CHECKLIST_TOOL_SCHEMA, _JUDGE_UPDATE_CHECKLIST_TOOL_SCHEMA); _extract_tool_call / _serialize_assistant_tool_calls / _call_judge_with_tool_choice helpers; decompose_goal and evaluate_checklist rewritten around forced tool calls; _normalize_update_args (replaces the JSON parser at the apply layer with the same 1-based→0-based conversion + terminal-status filter); system prompts updated to instruct the judge to use the tools instead of replying with JSON.
tests/hermes_cli/test_goals.py — 7 new tests covering both happy path and the JSON-content backstop.

Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).

github-actions · 2026-05-11T03:50:28Z

🔎 Lint report: `fix/goal-judge-tool-calls` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8129 on HEAD, 8129 on base (➖ 0)

🆕 New issues (3):

Rule	Count
`invalid-argument-type`	3

First entries

run_agent.py:13303: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
run_agent.py:13306: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:7160: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`

✅ Fixed issues (3):

Rule	Count
`invalid-argument-type`	3

First entries

run_agent.py:7160: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`
run_agent.py:13306: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:13303: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`

Unchanged: 4267 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)" This reverts commit 404640a.

NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).

…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.

NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).

…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit d7d4d91. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 0398de7. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit b968856.

NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).

…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.

NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).

…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit 4e224c0. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit f7865f8. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit be5fc05.

NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).

…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.

NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).

…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.

teknium1 merged commit a63a2b7 into main May 11, 2026
13 of 16 checks passed

teknium1 deleted the fix/goal-judge-tool-calls branch May 11, 2026 03:51

AhmetArif0 mentioned this pull request May 11, 2026

fix(goals): use tool_choice for freeform judge instead of JSON-text reply #23671

Closed

3 tasks

teknium1 mentioned this pull request May 11, 2026

revert: roll back /goal checklist + /subgoal feature stack #23813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(goals): force judge to use tool calls instead of JSON-text replies#23547

fix(goals): force judge to use tool calls instead of JSON-text replies#23547
teknium1 merged 1 commit into
mainfrom
fix/goal-judge-tool-calls

teknium1 commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented May 11, 2026

Summary

Why

Mechanics

Validation

Files

Uh oh!

github-actions Bot commented May 11, 2026

🔎 Lint report: fix/goal-judge-tool-calls vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🔎 Lint report: `fix/goal-judge-tool-calls` vs `origin/main`