fix(goals): salvage PR #19160 + auto-pause on consecutive judge parse failures by teknium1 · Pull Request #21576 · NousResearch/hermes-agent

teknium1 · 2026-05-08T00:20:58Z

Salvages PR #19160 (@liquidchen) and adds an auto-pause guard so weak judge models can't silently burn the entire goal turn budget.

Changes

cherry-picked a9f5f81 from PR fix(gateway): defer goal status notices until after response delivery #19160 (contributor authorship preserved via rebase-merge): routes goal status notices through adapter.send() instead of a non-existent send_message(), defers the notice until after the main response is delivered, and cancels queued synthetic goal continuations on /goal pause and /goal clear
new: GoalState.consecutive_parse_failures counter + DEFAULT_MAX_CONSECUTIVE_PARSE_FAILURES=3 threshold
_parse_judge_response() / judge_goal() now return a 3-tuple (verdict, reason, parse_failed) — parse_failed is True only on empty/non-JSON judge output, False on API/transport errors (those stay transient, fail-open)

evaluate_after_turn() auto-pauses after 3 consecutive parse failures with a config-pointer message:

⏸ Goal paused — the judge model (3 turns) isn't returning the required JSON verdict.
Route the judge to a stricter model in ~/.hermes/config.yaml:
  auxiliary:
    goal_judge:
      provider: openrouter
      model: google/gemini-3-flash-preview
Then /goal resume to continue.

test fix: test_goal_verdict_send.py was sharing a hardcoded session_id="goal-sess-1" across tests. It only worked before because _post_turn_goal_continuation was a never-awaited coroutine — the coroutine never ran, so cross-test state never leaked. PR fix(gateway): defer goal status notices until after response delivery #19160's async conversion made the coroutine actually run, surfacing the latent xdist test-leakage bug. Each test now gets a unique session_id via uuid.uuid4().
AUTHOR_MAP entry: ytchen0719@gmail.com → liquidchen

Why

Discord user barteq reported an infinite /goal loop when using deepseek-v4-flash as the judge. The judge returned output like:

judge returned empty response
judge reply was not JSON: "Let me analyze whether the goal is fully satisfied..."

The loop fail-opened to "continue" on every such turn, burning the entire 20-turn budget.

Gille already has a config-side workaround (route auxiliary.goal_judge to google/gemini-3-flash-preview), which is the right answer long-term — but this PR makes the agent detect the failure mode and tell the user HOW to fix it, instead of running silently until the budget is exhausted.

PR #19160 (same session) was also needed because /goal clear didn't stop an already-queued synthetic continuation — related failure mode, same shape.

Validation

	Before	After
Targeted tests (tests/hermes_cli/test_goals.py + tests/gateway/test_goal_status_notice.py + tests/gateway/test_goal_verdict_send.py)	3 failed (async conversion)	43/43 passing
barteq scenario: judge returns prose for 3 turns	runs all 20 turns	pauses at turn 3 with config-pointer
Transient ConnectionError for 10 turns	—	does NOT trip auto-pause (stays active)
2 bad + 1 good + 3 bad	—	pauses at turn 6 (counter reset by good reply)
Counter durability across GoalManager reload	—	persisted via `state_meta`

E2E tested via execute_code with real imports from the worktree, isolated HERMES_HOME, and mock judge clients returning the exact shapes barteq reported.

Credit

@liquidchen — PR fix(gateway): defer goal status notices until after response delivery #19160 cherry-pick (gateway send()/defer/continuation cleanup + 3 regression tests)
Teknium — Option B direction (auto-pause with config pointer)

Closes #19160.

Route goal status notices through the platform adapter send API and register post-delivery callbacks so completed-goal notices appear after the final assistant response. Also cancel queued synthetic goal continuations on /goal pause and /goal clear while preserving normal queued user messages.

Weak judge models (e.g. deepseek-v4-flash) return empty strings or prose when asked for the strict {done, reason} JSON verdict. The old code failed-open to continue on every such turn, burning the entire turn budget with log lines like judge returned empty response judge reply was not JSON: "Let me analyze whether the goal..." and /goal clear could not stop it mid-loop without /stop. After N=3 consecutive *parse* failures (transport/API errors don't count — those are transient), the loop auto-pauses and prints: ⏸ Goal paused — the judge model (3 turns) isn't returning the required JSON verdict. Route the judge to a stricter model in ~/.hermes/config.yaml: auxiliary: goal_judge: provider: openrouter model: google/gemini-3-flash-preview Then /goal resume to continue. The counter resets on any usable reply (both "done"/"continue" and API errors) and persists across GoalManager reloads so cross-session resumes carry the correct state. Also fixes test_goal_verdict_send.py sharing a hardcoded session_id across tests — the shared id only worked because the previous _post_turn_goal_continuation was a never-awaited coroutine. Now that PR #19160 made it properly awaited, the xdist test-leakage bug surfaced. Each test gets a unique session_id via uuid suffix.

github-actions · 2026-05-08T00:22:01Z

🔎 Lint report: `hermes/hermes-1912be3d` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7683 on HEAD, 7670 on base (🆕 +13)

🆕 New issues (5):

Rule	Count
`unresolved-attribute`	3
`invalid-parameter-default`	1
`unresolved-import`	1

First entries

tests/gateway/test_goal_status_notice.py:122: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_pending_messages` on type `FakeAdapter`
tests/gateway/test_goal_verdict_send.py:64: [invalid-parameter-default] invalid-parameter-default: Default value of type `None` is not assignable to annotated parameter type `str`
tests/gateway/test_goal_status_notice.py:5: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/hermes_cli/test_goals.py:495: [unresolved-attribute] unresolved-attribute: Attribute `consecutive_parse_failures` is not defined on `None` in union `GoalState | None`
tests/gateway/test_goal_status_notice.py:146: [unresolved-attribute] unresolved-attribute: Object of type `FakeAdapter` has no attribute `_pending_messages`

✅ Fixed issues: none

Unchanged: 4028 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

JC and others added 2 commits May 7, 2026 17:07

teknium1 merged commit 307c85e into main May 8, 2026
11 of 12 checks passed

teknium1 deleted the hermes/hermes-1912be3d branch May 8, 2026 00:33

teknium1 mentioned this pull request May 8, 2026

fix(gateway): defer goal status notices until after response delivery #19160

Closed

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 8, 2026

alt-glitch mentioned this pull request May 17, 2026

[Bug]: /goal can spam repeated completion messages when goal_judge errors fail-open to continue #27585

Open

briandevans mentioned this pull request May 18, 2026

fix(goals): pause loop when judge errors and response is terminal (#27585) #27752

Closed

19 tasks

alt-glitch mentioned this pull request May 25, 2026

Bug: /goal ✓ Goal achieved notice silently dropped — generation mismatch in post-delivery callback store #31922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(goals): salvage PR #19160 + auto-pause on consecutive judge parse failures#21576

fix(goals): salvage PR #19160 + auto-pause on consecutive judge parse failures#21576
teknium1 merged 2 commits into
mainfrom
hermes/hermes-1912be3d

teknium1 commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 8, 2026

Changes

Why

Validation

Credit

Uh oh!

github-actions Bot commented May 8, 2026

🔎 Lint report: hermes/hermes-1912be3d vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔎 Lint report: `hermes/hermes-1912be3d` vs `origin/main`