fix(gateway): avoid stale resume auto-runs#29188
Conversation
|
Follow-up note: this is a pretty painful user-facing gateway bug, not just cleanup. In Telegram DM topics, a restart could make the gateway synthesize a fresh auto-resume turn in topics where the useful work was already over or where the previous recovery attempt had already failed. The visible symptom is noisy/phantom assistant activity after restart — for example repeated rate-limit messages or the agent continuing in an old topic without new user input. This PR now hardens the recovery gate in a few ways:
Local verification:
|
|
Follow-up high-priority fix for Telegram DM topic status spam:
Local verification:
|
|
Follow-up fix pushed: eedb4af Root cause from the new repro was not startup Fix:
Verification run locally:
|
|
Follow-up fix pushed in 3939180: New root cause from session Fix:
Regression coverage added:
Verification:
|
Summary
Follow-up to #28576 / #28217 (the merged salvage of the pre-drain resume fix that duplicated #27831).
This tightens the durable
resume_pendingrecovery path so the gateway does not synthesize stale auto-resume turns after repeated restarts or after a session already finished cleanly.What changed
resume_pendingentry whose transcript already ends with a final assistant response is treated as stale and cleared instead of replayed.last_resume_marked_atvalues no longer revive old transcripts if the replayable transcript tail is outside the freshness window.Why
last_resume_marked_atcan be refreshed by repeated restarts even when no new transcript content was produced. If startup trusts only that marker, it can synthesize an empty recovery turn for old or already-answered context.The transcript is the safer source of truth: it tells us whether there is still replayable in-progress work.
Test plan
RED:
_transcript_tail_is_completed_assistantmissing / stale resume behavior not present).GREEN:
python -m pytest tests/gateway/test_restart_resume_pending.py::TestResumePendingSystemNote::test_stale_resume_pending_ignored_when_tail_is_completed_assistant tests/gateway/test_restart_resume_pending.py::TestResumePendingSystemNote::test_assistant_tool_call_tail_is_not_considered_completed tests/gateway/test_restart_resume_pending.py::test_startup_auto_resume_uses_transcript_timestamp_over_fresh_marker tests/gateway/test_restart_resume_pending.py::test_startup_auto_resume_clears_stale_marker_when_tail_already_answered tests/gateway/test_restart_resume_pending.py::test_drain_timeout_clears_pre_mark_for_session_that_finished_during_drain -q -o 'addopts='python -m pytest tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='python -m py_compile gateway/run.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.pygit diff --checkruff check gateway/run.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.pyAI Disclosure
This fix was prepared with AI assistance.