fix(goals): salvage PR #19160 + auto-pause on consecutive judge parse failures#21576
Merged
Conversation
Route goal status notices through the platform adapter send API and register post-delivery callbacks so completed-goal notices appear after the final assistant response. Also cancel queued synthetic goal continuations on /goal pause and /goal clear while preserving normal queued user messages.
Weak judge models (e.g. deepseek-v4-flash) return empty strings or prose
when asked for the strict {done, reason} JSON verdict. The old code
failed-open to continue on every such turn, burning the entire turn
budget with log lines like
judge returned empty response
judge reply was not JSON: "Let me analyze whether the goal..."
and /goal clear could not stop it mid-loop without /stop.
After N=3 consecutive *parse* failures (transport/API errors don't
count — those are transient), the loop auto-pauses and prints:
⏸ Goal paused — the judge model (3 turns) isn't returning the
required JSON verdict. Route the judge to a stricter model in
~/.hermes/config.yaml:
auxiliary:
goal_judge:
provider: openrouter
model: google/gemini-3-flash-preview
Then /goal resume to continue.
The counter resets on any usable reply (both "done"/"continue" and
API errors) and persists across GoalManager reloads so cross-session
resumes carry the correct state.
Also fixes test_goal_verdict_send.py sharing a hardcoded session_id
across tests — the shared id only worked because the previous
_post_turn_goal_continuation was a never-awaited coroutine. Now that
PR #19160 made it properly awaited, the xdist test-leakage bug
surfaced. Each test gets a unique session_id via uuid suffix.
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-attribute |
3 |
invalid-parameter-default |
1 |
unresolved-import |
1 |
First entries
tests/gateway/test_goal_status_notice.py:122: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_pending_messages` on type `FakeAdapter`
tests/gateway/test_goal_verdict_send.py:64: [invalid-parameter-default] invalid-parameter-default: Default value of type `None` is not assignable to annotated parameter type `str`
tests/gateway/test_goal_status_notice.py:5: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/hermes_cli/test_goals.py:495: [unresolved-attribute] unresolved-attribute: Attribute `consecutive_parse_failures` is not defined on `None` in union `GoalState | None`
tests/gateway/test_goal_status_notice.py:146: [unresolved-attribute] unresolved-attribute: Object of type `FakeAdapter` has no attribute `_pending_messages`
✅ Fixed issues: none
Unchanged: 4028 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
19 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvages PR #19160 (@liquidchen) and adds an auto-pause guard so weak judge models can't silently burn the entire goal turn budget.
Changes
adapter.send()instead of a non-existentsend_message(), defers the notice until after the main response is delivered, and cancels queued synthetic goal continuations on/goal pauseand/goal clearGoalState.consecutive_parse_failurescounter +DEFAULT_MAX_CONSECUTIVE_PARSE_FAILURES=3threshold_parse_judge_response()/judge_goal()now return a 3-tuple(verdict, reason, parse_failed)—parse_failedis True only on empty/non-JSON judge output, False on API/transport errors (those stay transient, fail-open)evaluate_after_turn()auto-pauses after 3 consecutive parse failures with a config-pointer message:test_goal_verdict_send.pywas sharing a hardcodedsession_id="goal-sess-1"across tests. It only worked before because_post_turn_goal_continuationwas a never-awaited coroutine — the coroutine never ran, so cross-test state never leaked. PR fix(gateway): defer goal status notices until after response delivery #19160's async conversion made the coroutine actually run, surfacing the latent xdist test-leakage bug. Each test now gets a uniquesession_idviauuid.uuid4().ytchen0719@gmail.com → liquidchenWhy
Discord user barteq reported an infinite
/goalloop when usingdeepseek-v4-flashas the judge. The judge returned output like:The loop fail-opened to "continue" on every such turn, burning the entire 20-turn budget.
Gille already has a config-side workaround (route
auxiliary.goal_judgetogoogle/gemini-3-flash-preview), which is the right answer long-term — but this PR makes the agent detect the failure mode and tell the user HOW to fix it, instead of running silently until the budget is exhausted.PR #19160 (same session) was also needed because
/goal cleardidn't stop an already-queued synthetic continuation — related failure mode, same shape.Validation
state_metaE2E tested via
execute_codewith real imports from the worktree, isolatedHERMES_HOME, and mock judge clients returning the exact shapes barteq reported.Credit
Closes #19160.