fix(goals): force judge to use tool calls instead of JSON-text replies#23547
Merged
Conversation
Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
invalid-argument-type |
3 |
First entries
run_agent.py:13303: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
run_agent.py:13306: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:7160: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
✅ Fixed issues (3):
| Rule | Count |
|---|---|
invalid-argument-type |
3 |
First entries
run_agent.py:7160: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`
run_agent.py:13306: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:13303: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`
Unchanged: 4267 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
3 tasks
teknium1
added a commit
that referenced
this pull request
May 11, 2026
* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)" This reverts commit 404640a.
rmulligan
pushed a commit
to rmulligan/hermes-agent
that referenced
this pull request
May 11, 2026
NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).
rmulligan
pushed a commit
to rmulligan/hermes-agent
that referenced
this pull request
May 11, 2026
…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.
JinyuID
pushed a commit
to JinyuID/hermes-agent
that referenced
this pull request
May 11, 2026
NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).
JinyuID
pushed a commit
to JinyuID/hermes-agent
that referenced
this pull request
May 11, 2026
…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit d7d4d91. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 0398de7. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit b968856.
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.
jsboige
pushed a commit
to jsboige/hermes-agent
that referenced
this pull request
May 14, 2026
NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).
jsboige
pushed a commit
to jsboige/hermes-agent
that referenced
this pull request
May 14, 2026
…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit 4e224c0. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit f7865f8. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit be5fc05.
AlexFoxD
pushed a commit
to AlexFoxD/hermes-agent
that referenced
this pull request
May 21, 2026
NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).
AlexFoxD
pushed a commit
to AlexFoxD/hermes-agent
that referenced
this pull request
May 21, 2026
…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
NousResearch#23547) Live-tested on gemini-3-flash-preview the judge kept returning empty or non-JSON content, tripping the consecutive-parse-failures auto- pause. Free-form JSON output is hopeful; tool-call schemas are enforced server-side by virtually every modern provider. Two new tools the judge calls: - submit_checklist(items) — Phase A, decompose - update_checklist(updates, new_items, reason) — Phase B, evaluate Both phases now call the auxiliary client with tool_choice forcing the right tool. read_file remains for Phase B history inspection, with the loop exiting only when update_checklist is called or the read budget is exhausted (at which point read_file is dropped from the toolbox and update_checklist is forced). Robustness: - _call_judge_with_tool_choice falls back tool_choice forced→required→ auto if the provider rejects a particular shape. - If a fully-broken provider still returns content instead of a tool call, the legacy JSON-text parsers stay around as a last-ditch backstop so we never silently lose a checklist. - _normalize_update_args replaces the JSON parser for the apply layer; same 1-based→0-based conversion + terminal-status filter. Live verification: same fizzbuzz goal that was hitting 'judge model returned unparseable output 3 turns in a row' before now terminates in 2 turns, all 11 items marked completed with item-specific evidence, no auto-pause. Agent log shows 'produced 11 checklist items via tool call' instead of the JSON- parse path. Tests: 7 new cases for the tool-call path (Phase A success, Phase B update only, Phase B read_file→update, JSON-content backstop, empty-text item dropping, non-terminal status filter).
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…rch#23813) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (NousResearch#23547)" This reverts commit a63a2b7. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (NousResearch#23530)" This reverts commit 4a080b1. * Revert "feat(goals): /goal checklist + /subgoal user controls (NousResearch#23456)" This reverts commit 404640a.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the goal judge's free-form JSON replies with forced tool calls. Two new tools —
submit_checklist(Phase A) andupdate_checklist(Phase B) — are passed to the auxiliary client withtool_choicepinned to the right tool. Tool-call schemas are enforced server-side; JSON in content is not.Why
Live test on
google/gemini-3-flash-previewhit the consecutive-parse-failures auto-pause: the judge model kept returning empty or non-JSON content, which tripped⏸ Goal paused — the judge model (3 turns) isn't returning the required JSON verdict. Same failure mode the original v3 design worried about; tool calls are the reliable path.Mechanics
decompose_goalcalls the auxiliary client with one tool (submit_checklist) andtool_choice={"function":"submit_checklist"}. Reads the items list directly fromtool_call.arguments.items.evaluate_checklistruns a tool loop with two tools (read_filefor history inspection,update_checklistfor the verdict). Each iteration forces a tool call. The loop exits whenupdate_checklistis called or the read budget is exhausted (at which pointread_fileis dropped from the toolbox andupdate_checklistis force-targeted)._call_judge_with_tool_choicetries forced→required→auto in order if a provider 400s on a particular shape.Validation
Agent log on the live re-run shows
produced 11 checklist items via tool call(Phase A) andupdates=11 new_items=0(Phase B) — both clean tool-call paths, no JSON-content fallback fired.Files
hermes_cli/goals.py— two new tool schemas (_JUDGE_SUBMIT_CHECKLIST_TOOL_SCHEMA,_JUDGE_UPDATE_CHECKLIST_TOOL_SCHEMA);_extract_tool_call/_serialize_assistant_tool_calls/_call_judge_with_tool_choicehelpers;decompose_goalandevaluate_checklistrewritten around forced tool calls;_normalize_update_args(replaces the JSON parser at the apply layer with the same 1-based→0-based conversion + terminal-status filter); system prompts updated to instruct the judge to use the tools instead of replying with JSON.tests/hermes_cli/test_goals.py— 7 new tests covering both happy path and the JSON-content backstop.