Skip to content

fix(gateway): avoid stale resume auto-runs#29188

Closed
Qwinty wants to merge 8 commits into
NousResearch:mainfrom
Qwinty:fix/resume-pending-freshness
Closed

fix(gateway): avoid stale resume auto-runs#29188
Qwinty wants to merge 8 commits into
NousResearch:mainfrom
Qwinty:fix/resume-pending-freshness

Conversation

@Qwinty

@Qwinty Qwinty commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #28576 / #28217 (the merged salvage of the pre-drain resume fix that duplicated #27831).

This tightens the durable resume_pending recovery path so the gateway does not synthesize stale auto-resume turns after repeated restarts or after a session already finished cleanly.

What changed

  • Startup auto-resume now checks the transcript tail, not only the session-index marker timestamp.
  • A resume_pending entry whose transcript already ends with a final assistant response is treated as stale and cleared instead of replayed.
  • Fresh last_resume_marked_at values no longer revive old transcripts if the replayable transcript tail is outside the freshness window.
  • During shutdown drain, pre-drain crash-safety markers are cleared for sessions that finish cleanly during the drain window, even if another session keeps the drain timed out.
  • Assistant tool-call tails remain resumable; only final assistant answers with explicit stop/end/completed finish markers are considered complete.

Why

last_resume_marked_at can be refreshed by repeated restarts even when no new transcript content was produced. If startup trusts only that marker, it can synthesize an empty recovery turn for old or already-answered context.

The transcript is the safer source of truth: it tells us whether there is still replayable in-progress work.

Test plan

RED:

  • Added regression tests first and confirmed they failed before the implementation (_transcript_tail_is_completed_assistant missing / stale resume behavior not present).

GREEN:

  • python -m pytest tests/gateway/test_restart_resume_pending.py::TestResumePendingSystemNote::test_stale_resume_pending_ignored_when_tail_is_completed_assistant tests/gateway/test_restart_resume_pending.py::TestResumePendingSystemNote::test_assistant_tool_call_tail_is_not_considered_completed tests/gateway/test_restart_resume_pending.py::test_startup_auto_resume_uses_transcript_timestamp_over_fresh_marker tests/gateway/test_restart_resume_pending.py::test_startup_auto_resume_clears_stale_marker_when_tail_already_answered tests/gateway/test_restart_resume_pending.py::test_drain_timeout_clears_pre_mark_for_session_that_finished_during_drain -q -o 'addopts='
  • python -m pytest tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='
  • python -m py_compile gateway/run.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
  • git diff --check
  • ruff check gateway/run.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py

AI Disclosure

This fix was prepared with AI assistance.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 20, 2026
@Qwinty

Qwinty commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up note: this is a pretty painful user-facing gateway bug, not just cleanup.

In Telegram DM topics, a restart could make the gateway synthesize a fresh auto-resume turn in topics where the useful work was already over or where the previous recovery attempt had already failed. The visible symptom is noisy/phantom assistant activity after restart — for example repeated rate-limit messages or the agent continuing in an old topic without new user input.

This PR now hardens the recovery gate in a few ways:

  • Use durable SQLite transcript timestamps for SessionStore.load_transcript() so DB-backed histories are not treated as legacy timestamp-less/fresh forever.
  • Clear resume_pending instead of scheduling auto-resume when the transcript already ends in a final assistant answer.
  • Clear resume_pending when the tail is only a gateway-generated recovery note from a previous empty auto-resume attempt, preventing restart loops.
  • Treat timed-out / undelivered clarify tool results as a user-wait boundary, not unfinished tool work to auto-process on startup or on the next unrelated user message.

Local verification:

  • python -m pytest -o addopts='' tests/gateway/test_restart_resume_pending.py -q
  • python -m pytest -o addopts='' tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_telegram_topic_mode.py -q
  • python -m py_compile gateway/run.py gateway/session.py hermes_state.py tests/gateway/test_restart_resume_pending.py
  • ruff check gateway/run.py gateway/session.py hermes_state.py tests/gateway/test_restart_resume_pending.py
  • git diff --check

@Qwinty

Qwinty commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up high-priority fix for Telegram DM topic status spam:

  • Root cause: queued/interrupt follow-up recursion kept the outer _run_agent per-turn background helpers alive. Each nested run started its own long-running "Still working..." notifier, so several timers could send to the same topic seconds apart with different elapsed clocks.
  • Fix: add per-session notifier ownership tokens and cancel the outer run's auxiliary tasks before recursing into a queued follow-up.
  • Regression: tests/gateway/test_long_running_notifications.py verifies stale outer notifiers do not send after a queued follow-up starts.

Local verification:

  • python -m pytest -q -o 'addopts=' tests/gateway/test_long_running_notifications.py tests/gateway/test_run_cleanup_progress.py tests/gateway/test_queue_consumption.py tests/gateway/test_restart_resume_pending.py
  • python -m py_compile gateway/run.py tests/gateway/test_long_running_notifications.py
  • git diff --check
  • python -m ruff check gateway/run.py tests/gateway/test_long_running_notifications.py

@Qwinty

Qwinty commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up fix pushed: eedb4af fix(gateway): suppress killed process completion turns.

Root cause from the new repro was not startup resume_pending: sent the requested final after killing the process, then the process watcher still injected [IMPORTANT: Background process ... completed] for the killed notify_on_complete process, creating a fresh internal turn and a second visible assistant reply.

Fix:

  • Treat process(action=kill) / already-exited kill as completion consumption, same as wait/poll/log.
  • For notify_on_complete watcher completions, skip all delivery when the process completion is already consumed, including the plain text notification fallback.
  • Added regressions in tests/tools/test_notify_on_complete.py and tests/gateway/test_background_process_notifications.py for the killed notify process not re-entering the session.

Verification run locally:

  • uv run --extra dev pytest -o addopts='' tests/tools/test_notify_on_complete.py tests/gateway/test_background_process_notifications.py tests/gateway/test_internal_event_bypass_pairing.py tests/gateway/test_duplicate_reply_suppression.py tests/gateway/test_long_running_notifications.py -q -> 87 passed
  • uv run --extra dev --with ptyprocess pytest -o addopts='' tests/tools/test_process_registry.py -q -> 58 passed
  • python3 -m pytest -o addopts='' tests/gateway/test_restart_resume_pending.py tests/gateway/test_telegram_topic_mode.py -q -> 116 passed
  • uv run --extra dev ruff check gateway/run.py tools/process_registry.py tests/tools/test_notify_on_complete.py tests/gateway/test_background_process_notifications.py -> passed
  • python3 -m py_compile ... and git diff --check -> passed

@Qwinty

Qwinty commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up fix pushed in 3939180: fix(gateway): drop stale watch-pattern continuations.

New root cause from session 20260519_180814_588b4293: this was not another startup resume_pending loop. At 2026-05-21 00:02 the completed turn drained queued process watch_pattern matches from the Gold Apple login probe, formatted them as gateway process notifications, and re-injected them through adapter.handle_message() into the same Telegram DM Topic. Because the original final answer had already been delivered, that synthetic internal message created a fresh agent turn in the finished topic.

Fix:

  • propagate per-session run_generation through gateway.session_context into terminal background process sessions and watch events;
  • drop watch notifications whose generation is stale for their session_key;
  • drop same-turn watch notifications during the completing run instead of re-entering the session;
  • keep existing cross-session/current watch notification behavior intact.

Regression coverage added:

  • same-turn watch_match does not call adapter.handle_message();
  • stale-generation watch_match is dropped;
  • watch matches carry run_generation from the process session;
  • session context exposes HERMES_SESSION_RUN_GENERATION.

Verification:

  • HERMES_HOME=/tmp/tmp.OS8JucTGYC python3 -m pytest -o addopts='' tests/tools/test_watch_patterns.py tests/gateway/test_background_process_notifications.py tests/gateway/test_duplicate_reply_suppression.py tests/gateway/test_long_running_notifications.py tests/gateway/test_internal_event_bypass_pairing.py tests/tools/test_notify_on_complete.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_telegram_topic_mode.py tests/gateway/test_session_env.py tests/gateway/test_session_hygiene.py tests/gateway/test_slash_access_dispatch.py tests/gateway/test_approve_deny_commands.py tests/gateway/test_reload_skills_command.py tests/gateway/test_running_agent_session_toggles.py tests/gateway/test_unknown_command.py tests/gateway/test_steer_command.py tests/gateway/test_status_command.py -q -> 348 passed, 1 warning
  • uv run --extra dev pytest -o addopts='' tests/gateway/test_background_process_notifications.py::test_same_turn_watch_match_does_not_reenter_session tests/gateway/test_background_process_notifications.py::test_stale_watch_match_generation_is_dropped tests/tools/test_watch_patterns.py::TestCheckWatchPatterns::test_match_carries_run_generation tests/gateway/test_session_env.py::test_set_session_env_sets_contextvars -q -> 4 passed
  • uv run --extra dev ruff check ... -> passed
  • python3 -m py_compile ... -> passed
  • git diff --check -> passed

@Qwinty

Qwinty commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by the narrower restart-resume recovery PR: #30030. This older branch accumulated unrelated gateway/runtime/test changes, so #30030 keeps the review scope to restart resume freshness, Telegram reply anchors, and failed-turn goal continuation gating.

@Qwinty Qwinty closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants