Gateway restart can lose long-running sessions during shutdown drain

## Bug Description

Long-running gateway sessions can be lost across a service restart if the process is killed while it is still waiting for active agents to drain.

The current restart/shutdown flow only writes the durable `resume_pending` marker after `_drain_active_agents(timeout)` returns `timed_out=True`. Service managers such as systemd can terminate the process while it is still inside that drain wait, before the timeout branch runs.

If the session's `updated_at` is older than the startup `suspend_recently_active()` fallback window, the next gateway boot has no durable marker to resume or suspend the in-flight session cleanly.

## Steps to Reproduce

1. Start a gateway session from a messaging platform.
2. Run a long task whose `SessionEntry.updated_at` is older than the startup fallback window because the turn has not completed yet.
3. Restart the gateway via the service manager while the task is still active.
4. Let the service manager terminate the old process while it is still inside the drain wait.
5. Start the gateway again and send a message in the same chat/thread/topic.

## Expected Behavior

The active session is durably marked before the vulnerable drain wait begins, so the next process can recover the interrupted session state instead of treating it as a normal idle session.

## Actual Behavior

The durable `resume_pending` marker may never be written if the old process is killed during the drain wait. Long-running sessions outside the startup freshness window can then appear stopped or reset after restart.

## Notes

This is in the same failure-mode family as the existing restart resume work, but the race happens earlier than the post-timeout `mark_resume_pending()` path: the process can die before that branch gets control.

A narrow fix is to pre-mark currently running sessions before awaiting drain, then clear only those early markers if the drain completes gracefully.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway restart can lose long-running sessions during shutdown drain #27856

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gateway restart can lose long-running sessions during shutdown drain #27856

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions