Bug Description
Long-running gateway sessions can be lost across a service restart if the process is killed while it is still waiting for active agents to drain.
The current restart/shutdown flow only writes the durable resume_pending marker after _drain_active_agents(timeout) returns timed_out=True. Service managers such as systemd can terminate the process while it is still inside that drain wait, before the timeout branch runs.
If the session's updated_at is older than the startup suspend_recently_active() fallback window, the next gateway boot has no durable marker to resume or suspend the in-flight session cleanly.
Steps to Reproduce
- Start a gateway session from a messaging platform.
- Run a long task whose
SessionEntry.updated_at is older than the startup fallback window because the turn has not completed yet.
- Restart the gateway via the service manager while the task is still active.
- Let the service manager terminate the old process while it is still inside the drain wait.
- Start the gateway again and send a message in the same chat/thread/topic.
Expected Behavior
The active session is durably marked before the vulnerable drain wait begins, so the next process can recover the interrupted session state instead of treating it as a normal idle session.
Actual Behavior
The durable resume_pending marker may never be written if the old process is killed during the drain wait. Long-running sessions outside the startup freshness window can then appear stopped or reset after restart.
Notes
This is in the same failure-mode family as the existing restart resume work, but the race happens earlier than the post-timeout mark_resume_pending() path: the process can die before that branch gets control.
A narrow fix is to pre-mark currently running sessions before awaiting drain, then clear only those early markers if the drain completes gracefully.
Bug Description
Long-running gateway sessions can be lost across a service restart if the process is killed while it is still waiting for active agents to drain.
The current restart/shutdown flow only writes the durable
resume_pendingmarker after_drain_active_agents(timeout)returnstimed_out=True. Service managers such as systemd can terminate the process while it is still inside that drain wait, before the timeout branch runs.If the session's
updated_atis older than the startupsuspend_recently_active()fallback window, the next gateway boot has no durable marker to resume or suspend the in-flight session cleanly.Steps to Reproduce
SessionEntry.updated_atis older than the startup fallback window because the turn has not completed yet.Expected Behavior
The active session is durably marked before the vulnerable drain wait begins, so the next process can recover the interrupted session state instead of treating it as a normal idle session.
Actual Behavior
The durable
resume_pendingmarker may never be written if the old process is killed during the drain wait. Long-running sessions outside the startup freshness window can then appear stopped or reset after restart.Notes
This is in the same failure-mode family as the existing restart resume work, but the race happens earlier than the post-timeout
mark_resume_pending()path: the process can die before that branch gets control.A narrow fix is to pre-mark currently running sessions before awaiting drain, then clear only those early markers if the drain completes gracefully.