fix(gateway): avoid stale resume auto-runs by Qwinty · Pull Request #29188 · NousResearch/hermes-agent

Qwinty · 2026-05-20T08:34:44Z

Summary

Follow-up to #28576 / #28217 (the merged salvage of the pre-drain resume fix that duplicated #27831).

This tightens the durable resume_pending recovery path so the gateway does not synthesize stale auto-resume turns after repeated restarts or after a session already finished cleanly.

What changed

Startup auto-resume now checks the transcript tail, not only the session-index marker timestamp.
A resume_pending entry whose transcript already ends with a final assistant response is treated as stale and cleared instead of replayed.
Fresh last_resume_marked_at values no longer revive old transcripts if the replayable transcript tail is outside the freshness window.
During shutdown drain, pre-drain crash-safety markers are cleared for sessions that finish cleanly during the drain window, even if another session keeps the drain timed out.
Assistant tool-call tails remain resumable; only final assistant answers with explicit stop/end/completed finish markers are considered complete.

Why

last_resume_marked_at can be refreshed by repeated restarts even when no new transcript content was produced. If startup trusts only that marker, it can synthesize an empty recovery turn for old or already-answered context.

The transcript is the safer source of truth: it tells us whether there is still replayable in-progress work.

Test plan

RED:

Added regression tests first and confirmed they failed before the implementation (_transcript_tail_is_completed_assistant missing / stale resume behavior not present).

GREEN:

python -m pytest tests/gateway/test_restart_resume_pending.py::TestResumePendingSystemNote::test_stale_resume_pending_ignored_when_tail_is_completed_assistant tests/gateway/test_restart_resume_pending.py::TestResumePendingSystemNote::test_assistant_tool_call_tail_is_not_considered_completed tests/gateway/test_restart_resume_pending.py::test_startup_auto_resume_uses_transcript_timestamp_over_fresh_marker tests/gateway/test_restart_resume_pending.py::test_startup_auto_resume_clears_stale_marker_when_tail_already_answered tests/gateway/test_restart_resume_pending.py::test_drain_timeout_clears_pre_mark_for_session_that_finished_during_drain -q -o 'addopts='
python -m pytest tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='
python -m py_compile gateway/run.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
git diff --check
ruff check gateway/run.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py

AI Disclosure

This fix was prepared with AI assistance.

Qwinty · 2026-05-20T10:49:41Z

Follow-up note: this is a pretty painful user-facing gateway bug, not just cleanup.

In Telegram DM topics, a restart could make the gateway synthesize a fresh auto-resume turn in topics where the useful work was already over or where the previous recovery attempt had already failed. The visible symptom is noisy/phantom assistant activity after restart — for example repeated rate-limit messages or the agent continuing in an old topic without new user input.

This PR now hardens the recovery gate in a few ways:

Use durable SQLite transcript timestamps for SessionStore.load_transcript() so DB-backed histories are not treated as legacy timestamp-less/fresh forever.
Clear resume_pending instead of scheduling auto-resume when the transcript already ends in a final assistant answer.
Clear resume_pending when the tail is only a gateway-generated recovery note from a previous empty auto-resume attempt, preventing restart loops.
Treat timed-out / undelivered clarify tool results as a user-wait boundary, not unfinished tool work to auto-process on startup or on the next unrelated user message.

Local verification:

python -m pytest -o addopts='' tests/gateway/test_restart_resume_pending.py -q
python -m pytest -o addopts='' tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_telegram_topic_mode.py -q
python -m py_compile gateway/run.py gateway/session.py hermes_state.py tests/gateway/test_restart_resume_pending.py
ruff check gateway/run.py gateway/session.py hermes_state.py tests/gateway/test_restart_resume_pending.py
git diff --check

Qwinty · 2026-05-20T11:58:11Z

Follow-up high-priority fix for Telegram DM topic status spam:

Root cause: queued/interrupt follow-up recursion kept the outer _run_agent per-turn background helpers alive. Each nested run started its own long-running "Still working..." notifier, so several timers could send to the same topic seconds apart with different elapsed clocks.
Fix: add per-session notifier ownership tokens and cancel the outer run's auxiliary tasks before recursing into a queued follow-up.
Regression: tests/gateway/test_long_running_notifications.py verifies stale outer notifiers do not send after a queued follow-up starts.

Local verification:

python -m pytest -q -o 'addopts=' tests/gateway/test_long_running_notifications.py tests/gateway/test_run_cleanup_progress.py tests/gateway/test_queue_consumption.py tests/gateway/test_restart_resume_pending.py
python -m py_compile gateway/run.py tests/gateway/test_long_running_notifications.py
git diff --check
python -m ruff check gateway/run.py tests/gateway/test_long_running_notifications.py

Qwinty · 2026-05-20T15:34:10Z

Follow-up fix pushed: eedb4af fix(gateway): suppress killed process completion turns.

Root cause from the new repro was not startup resume_pending: sent the requested final after killing the process, then the process watcher still injected [IMPORTANT: Background process ... completed] for the killed notify_on_complete process, creating a fresh internal turn and a second visible assistant reply.

Fix:

Treat process(action=kill) / already-exited kill as completion consumption, same as wait/poll/log.
For notify_on_complete watcher completions, skip all delivery when the process completion is already consumed, including the plain text notification fallback.
Added regressions in tests/tools/test_notify_on_complete.py and tests/gateway/test_background_process_notifications.py for the killed notify process not re-entering the session.

Verification run locally:

uv run --extra dev pytest -o addopts='' tests/tools/test_notify_on_complete.py tests/gateway/test_background_process_notifications.py tests/gateway/test_internal_event_bypass_pairing.py tests/gateway/test_duplicate_reply_suppression.py tests/gateway/test_long_running_notifications.py -q -> 87 passed
uv run --extra dev --with ptyprocess pytest -o addopts='' tests/tools/test_process_registry.py -q -> 58 passed
python3 -m pytest -o addopts='' tests/gateway/test_restart_resume_pending.py tests/gateway/test_telegram_topic_mode.py -q -> 116 passed
uv run --extra dev ruff check gateway/run.py tools/process_registry.py tests/tools/test_notify_on_complete.py tests/gateway/test_background_process_notifications.py -> passed
python3 -m py_compile ... and git diff --check -> passed

Qwinty · 2026-05-20T21:27:19Z

Follow-up fix pushed in 3939180: fix(gateway): drop stale watch-pattern continuations.

New root cause from session 20260519_180814_588b4293: this was not another startup resume_pending loop. At 2026-05-21 00:02 the completed turn drained queued process watch_pattern matches from the Gold Apple login probe, formatted them as gateway process notifications, and re-injected them through adapter.handle_message() into the same Telegram DM Topic. Because the original final answer had already been delivered, that synthetic internal message created a fresh agent turn in the finished topic.

Fix:

propagate per-session run_generation through gateway.session_context into terminal background process sessions and watch events;
drop watch notifications whose generation is stale for their session_key;
drop same-turn watch notifications during the completing run instead of re-entering the session;
keep existing cross-session/current watch notification behavior intact.

Regression coverage added:

same-turn watch_match does not call adapter.handle_message();
stale-generation watch_match is dropped;
watch matches carry run_generation from the process session;
session context exposes HERMES_SESSION_RUN_GENERATION.

Verification:

HERMES_HOME=/tmp/tmp.OS8JucTGYC python3 -m pytest -o addopts='' tests/tools/test_watch_patterns.py tests/gateway/test_background_process_notifications.py tests/gateway/test_duplicate_reply_suppression.py tests/gateway/test_long_running_notifications.py tests/gateway/test_internal_event_bypass_pairing.py tests/tools/test_notify_on_complete.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_telegram_topic_mode.py tests/gateway/test_session_env.py tests/gateway/test_session_hygiene.py tests/gateway/test_slash_access_dispatch.py tests/gateway/test_approve_deny_commands.py tests/gateway/test_reload_skills_command.py tests/gateway/test_running_agent_session_toggles.py tests/gateway/test_unknown_command.py tests/gateway/test_steer_command.py tests/gateway/test_status_command.py -q -> 348 passed, 1 warning
uv run --extra dev pytest -o addopts='' tests/gateway/test_background_process_notifications.py::test_same_turn_watch_match_does_not_reenter_session tests/gateway/test_background_process_notifications.py::test_stale_watch_match_generation_is_dropped tests/tools/test_watch_patterns.py::TestCheckWatchPatterns::test_match_carries_run_generation tests/gateway/test_session_env.py::test_set_session_env_sets_contextvars -q -> 4 passed
uv run --extra dev ruff check ... -> passed
python3 -m py_compile ... -> passed
git diff --check -> passed

Qwinty · 2026-05-21T20:51:54Z

Superseded by the narrower restart-resume recovery PR: #30030. This older branch accumulated unrelated gateway/runtime/test changes, so #30030 keeps the review scope to restart resume freshness, Telegram reply anchors, and failed-turn goal continuation gating.

fix(gateway): avoid stale resume auto-runs

d60b67e

This was referenced May 20, 2026

fix(gateway): premark active sessions before drain #27831

Closed

Gateway restart can lose long-running sessions during shutdown drain #27856

Closed

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 20, 2026

Qwinty added 3 commits May 20, 2026 11:55

test(plugins): cover xai web provider

a7163b7

test(cli): keep update hangup helpers reload-safe

69da826

fix(gateway): stop recovery-note restart loops

d80c81f

fix(gateway): dedupe long-running status timers

8ec815a

fix(gateway): suppress killed process completion turns

eedb4af

Qwinty added 2 commits May 20, 2026 18:54

test(cli): tolerate update probe during autostash test

89c471b

fix(gateway): drop stale watch-pattern continuations

3939180

alt-glitch mentioned this pull request May 21, 2026

fix(gateway): skip shutdown-timeout startup auto-resume #29728

Open

Qwinty mentioned this pull request May 21, 2026

fix(gateway): harden restart resume recovery #30030

Open

Qwinty closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): avoid stale resume auto-runs#29188

fix(gateway): avoid stale resume auto-runs#29188
Qwinty wants to merge 8 commits into
NousResearch:mainfrom
Qwinty:fix/resume-pending-freshness

Qwinty commented May 20, 2026

Uh oh!

Qwinty commented May 20, 2026

Uh oh!

Qwinty commented May 20, 2026

Uh oh!

Qwinty commented May 20, 2026 •

edited

Loading

Uh oh!

Qwinty commented May 20, 2026

Uh oh!

Qwinty commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qwinty commented May 20, 2026

Summary

What changed

Why

Test plan

AI Disclosure

Uh oh!

Qwinty commented May 20, 2026

Uh oh!

Qwinty commented May 20, 2026

Uh oh!

Qwinty commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qwinty commented May 20, 2026

Uh oh!

Qwinty commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qwinty commented May 20, 2026 •

edited

Loading