Skip to content

Gateway restart can lose long-running sessions during shutdown drain #27856

@Qwinty

Description

@Qwinty

Bug Description

Long-running gateway sessions can be lost across a service restart if the process is killed while it is still waiting for active agents to drain.

The current restart/shutdown flow only writes the durable resume_pending marker after _drain_active_agents(timeout) returns timed_out=True. Service managers such as systemd can terminate the process while it is still inside that drain wait, before the timeout branch runs.

If the session's updated_at is older than the startup suspend_recently_active() fallback window, the next gateway boot has no durable marker to resume or suspend the in-flight session cleanly.

Steps to Reproduce

  1. Start a gateway session from a messaging platform.
  2. Run a long task whose SessionEntry.updated_at is older than the startup fallback window because the turn has not completed yet.
  3. Restart the gateway via the service manager while the task is still active.
  4. Let the service manager terminate the old process while it is still inside the drain wait.
  5. Start the gateway again and send a message in the same chat/thread/topic.

Expected Behavior

The active session is durably marked before the vulnerable drain wait begins, so the next process can recover the interrupted session state instead of treating it as a normal idle session.

Actual Behavior

The durable resume_pending marker may never be written if the old process is killed during the drain wait. Long-running sessions outside the startup freshness window can then appear stopped or reset after restart.

Notes

This is in the same failure-mode family as the existing restart resume work, but the race happens earlier than the post-timeout mark_resume_pending() path: the process can die before that branch gets control.

A narrow fix is to pre-mark currently running sessions before awaiting drain, then clear only those early markers if the drain completes gracefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions