Skip to content

fix(gateway): fire typed session_end on shutdown/restart for active sessions#80673

Merged
steipete merged 5 commits into
openclaw:mainfrom
pandadev66:fix/session-end-on-gateway-shutdown
May 11, 2026
Merged

fix(gateway): fire typed session_end on shutdown/restart for active sessions#80673
steipete merged 5 commits into
openclaw:mainfrom
pandadev66:fix/session-end-on-gateway-shutdown

Conversation

@pandadev66

Copy link
Copy Markdown
Contributor

Fixes #57790.

session_end was only fired when a session was replaced, reset, deleted, or compacted -- the gateway shutdown/restart paths closed the process without enumerating active sessions, so downstream session_end plugins (e.g. claude-mem) accumulated ghost rows in active state across restarts. Issue reporter saw 11 orphaned sessions cause 63 timeouts/day from agent pool exhaustion.

ClawSweeper triage explicitly cleared this as "Queue a focused fix PR because the affected path, compatibility boundary, likely files, and regression tests are clear" with reproducibility confirmed at the source level.

Change

  • New in-memory tracker (src/gateway/active-sessions-shutdown-tracker.ts) keyed by sessionId. Populated by emitGatewaySessionStartPluginHook and forgotten unconditionally by emitGatewaySessionEndPluginHook (even when no plugin currently listens), so any session that has already been finalized through replace / reset / delete / compaction is never re-fired by the shutdown drain.
  • New drainActiveSessionsForShutdown({ reason }) in src/gateway/session-reset-service.ts. Iterates the tracker, fires the same emitGatewaySessionEndPluginHook the normal lifecycle path uses, and races a bounded 2 s total timeout so a slow plugin cannot block SIGTERM/SIGINT.
  • createGatewayCloseHandler wires the drain between the existing gateway:shutdown/gateway:pre-restart lifecycle hooks and the subsystem teardown steps. Reason resolves to restart when restartExpectedMs is set, otherwise shutdown. Drain timeout surfaces on the shutdown result as a session-end-drain warning so operators can spot a hanging plugin without losing the process.
  • PluginHookSessionEndReason gains "shutdown" and "restart". emitGatewaySessionStartPluginHook now also accepts storePath, sessionFile, and agentId so the shutdown drain can build the same session_end payload shape the normal lifecycle emits; the two existing call sites (server-methods/sessions.ts create path and session-reset-service.ts reset path) pass those through.

Docs:

  • docs/plugins/hooks.md documents the new shutdown/restart reason values.
  • docs/automation/hooks.md documents the post-gateway:shutdown drain and its bounded execution.

Verification

  • pnpm test src/gateway/active-sessions-shutdown-tracker.test.ts src/gateway/drain-active-sessions-for-shutdown.test.ts src/gateway/server-close.test.ts -- 35/35 pass (19 new tracker + drain cases, plus 4 new close-handler cases for the shutdown/restart drain wiring).
  • pnpm test src/gateway/server.sessions.reset-hooks.test.ts src/gateway/server.sessions.delete-lifecycle.test.ts src/cli/gateway-cli/run-loop.test.ts -- all clawsweeper-listed acceptance suites green.
  • pnpm exec oxfmt --check --threads=1 clean on every touched file.

Real behavior proof

Behavior addressed: session_end plugin hooks (claude-mem and the rest of the documented public session_end contract) firing for every session that was active when the gateway process is stopped or restarted, with reason "shutdown" or "restart", and never double-firing for a session that was already finalized through replace / reset / delete / compaction. Issue reporter on #57790 cited 11 orphaned sessions causing 63 timeout errors in a single day from agent pool exhaustion when claude-mem never received the matching session_end.

Real environment tested: local Linux 6.17 (Node 22.22.2). Live Node process with the real production helpers from src/gateway/session-reset-service.ts -- emitGatewaySessionStartPluginHook, emitGatewaySessionEndPluginHook, and the new drainActiveSessionsForShutdown. The plugin host-hook runner is installed via the documented Symbol.for("openclaw.plugins.hook-runner-global-state") singleton (the same hook the bundled plugin loader uses at gateway startup); the test runner is a tiny in-process listener that records every session_end event the gateway actually emits. No vitest mocks, no fake timers.

Exact steps or command run after this patch:

node --import tsx /tmp/proof-57790.mts

Where /tmp/proof-57790.mts is a small driver that installs a claude-mem-shaped listener on the global hook-runner singleton, calls emitGatewaySessionStartPluginHook for three real sessions (Telegram / Discord / Slack-style), closes one of them normally via emitGatewaySessionEndPluginHook(reason: "reset"), then triggers drainActiveSessionsForShutdown({ reason: "shutdown" }) to simulate the SIGTERM path.

Evidence after fix: terminal output captured directly from the live Node process. Each session_end line is the event payload the listener received from the production emit helpers (not a mock).

--- simulating claude-mem-style plugin attached to session_end ---
noted 3 active sessions: sess-A-001, sess-B-002, sess-C-003

--- one session is closed normally via reset before shutdown ---
captured 1 session_end event(s) so far
  sessionId=sess-B-002 reason=reset

--- gateway receives SIGTERM, shutdown finalizer runs ---
drain returned: {"emittedSessionIds":["sess-A-001","sess-C-003"],"timedOut":false} elapsedMs=1

--- session_end events received by plugin during full lifecycle ---
  [0] sessionId=sess-B-002 sessionKey=agent:main:discord:efgh reason=reset
  [1] sessionId=sess-A-001 sessionKey=agent:main:telegram:abcd reason=shutdown
  [2] sessionId=sess-C-003 sessionKey=agent:main:slack:ijkl reason=shutdown

--- replaying drain proves the tracker is now empty (no double-fire) ---
second drain: emitted=0 timedOut=false captured=0

Observed result after fix:

  • All three sessions that received session_start now also receive a matching session_end: sess-B-002 with reason reset (normal close), sess-A-001 and sess-C-003 with reason shutdown (shutdown finalizer). Before this PR the shutdown path emitted nothing, so sess-A and sess-C would have stayed in active state until claude-mem (or the operator's band-aid cron) noticed the ghost rows.
  • sess-B-002 is not re-emitted with reason shutdown even though it received session_start. The tracker forgot it the moment its normal session_end fired, so the shutdown drain has nothing left to do for it. This is exactly the "must avoid double-firing for sessions already finalized by reset, delete, compaction, or replacement" constraint clawsweeper listed.
  • The drain completed in 1 ms wall time, well under the 2 s bounded budget; even with three sessions the finalizer never approaches the SIGTERM shutdown deadline.
  • Replaying the drain immediately afterwards is a clean no-op: emitted=0, captured=0. The tracker is empty, so a second SIGTERM cycle on the same gateway would not double-fire either.

What was not tested: an end-to-end SIGTERM against a fully booted openclaw gateway process driving a real claude-mem plugin install -- the runtime proof above exercises the production emit helpers, the production global hook-runner singleton, and the new bounded drain, but uses an in-process listener instead of a separate plugin process. The unit suite (active-sessions-shutdown-tracker.test.ts, drain-active-sessions-for-shutdown.test.ts, server-close.test.ts) covers the close-handler wiring, the per-reason fan-out, and the timeout-warning path; the existing server.sessions.reset-hooks.test.ts and server.sessions.delete-lifecycle.test.ts continue to pass and prove the normal lifecycle still emits session_end correctly so the tracker's forget semantics stay correct.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime size: L proof: supplied External PR includes structured after-fix real behavior proof. labels May 11, 2026
@pandadev66 pandadev66 force-pushed the fix/session-end-on-gateway-shutdown branch from a8eb12a to 0aeadb6 Compare May 11, 2026 14:28
@pandadev66

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
Review failed before ClawSweeper could summarize the requested change.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Real behavior proof
Not applicable: Real behavior proof was not assessed because the Codex review failed.

Next step before merge
Review did not complete, so no work-lane recommendation was made.

Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

What I checked:

  • failure reason: timeout.
  • codex failure detail: Codex review failed for this PR: spawnSync codex ETIMEDOUT.
  • codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)

Remaining risk / open question:

  • No close action taken because the review did not complete.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 80d1df6e6830.

@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
pandadev66 added a commit to pandadev66/openclaw that referenced this pull request May 11, 2026
Fix the two P1 findings the bot raised:

- Await each session_end handler during the shutdown drain so the bounded
  total-timeout races real plugin work instead of just the synchronous
  for-loop. The drain now inlines the runSessionEnd call (uses the same
  buildSessionEndHookPayload + resolveStableSessionEndTranscript helpers
  the centralized emitter uses) and surfaces a hung plugin handler as
  timedOut=true. New unit cases prove both that the drain waits for an
  in-flight handler to settle and that a never-resolving handler is cut
  off at the configured budget.

- Plumb the active-session tracker through the channel reply path
  (`src/auto-reply/reply/session.ts`) and the compaction lifecycle helper
  (`src/auto-reply/reply/session-updates.ts`), which previously called
  `hookRunner.runSessionStart/End` directly without going through
  `emitGatewaySessionStartPluginHook`. Now every public session lifecycle
  emitter notes / forgets the same tracker, so normal channel sessions
  and compacted sessions are both drained on shutdown and never
  double-fired by a subsequent drain after a normal close.

Refs openclaw#57790.
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@pandadev66 pandadev66 force-pushed the fix/session-end-on-gateway-shutdown branch from d741dcd to 44e0d21 Compare May 11, 2026 15:04
@pandadev66

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
pandadev66 and others added 5 commits May 11, 2026 17:18
…essions (openclaw#57790)

`session_end` was only fired when a session was replaced, reset, deleted, or
compacted -- the gateway shutdown/restart paths closed the process without
enumerating active sessions, so downstream `session_end` plugins
(e.g. claude-mem) accumulated ghost rows in `active` state across restarts.
Issue reporter saw 11 orphaned sessions cause 63 timeouts/day from agent
pool exhaustion.

Add an in-memory active-session tracker
(`src/gateway/active-sessions-shutdown-tracker.ts`) populated by
`emitGatewaySessionStartPluginHook` and forgotten unconditionally by
`emitGatewaySessionEndPluginHook` (even when no plugin listens), so any
session that has already been finalized through the normal lifecycle is
never re-fired by the shutdown drain. The close handler then calls a new
`drainActiveSessionsForShutdown({ reason })` in `session-reset-service.ts`
between the `gateway:shutdown`/`gateway:pre-restart` lifecycle hooks and
the subsystem teardown steps; the drain races a bounded 2 s total timeout
so a slow plugin cannot block SIGTERM/SIGINT, surfacing the timeout as a
`session-end-drain` warning on the shutdown result.

Extend `PluginHookSessionEndReason` with `"shutdown"` and `"restart"` so
plugins can distinguish a graceful close from a planned restart; the close
handler picks `restart` when `restartExpectedMs` is set and `shutdown`
otherwise. Update `emitGatewaySessionStartPluginHook` to also accept
`storePath`, `sessionFile`, and `agentId` so the shutdown drain can build
the same `session_end` payload shape the normal lifecycle path emits, and
update the existing call sites in `session-reset-service.ts` and
`server-methods/sessions.ts` to pass those fields through.

Tests:

- `src/gateway/active-sessions-shutdown-tracker.test.ts` (new) -- tracker
  insert/forget/clear semantics, idempotent re-noting, empty-id guard,
  snapshot isolation.
- `src/gateway/drain-active-sessions-for-shutdown.test.ts` (new) -- drain
  fires `session_end` with the right reason for every tracked session,
  skips sessions already finalized via reset/delete/compaction, and still
  forgets sessions even when no `session_end` plugin is registered.
- `src/gateway/server-close.test.ts` -- four new cases covering the
  shutdown/restart drain wiring, the bounded timeout warning, and the
  drain-skipped-when-no-helper case.

Docs:

- `docs/plugins/hooks.md` documents the new `shutdown`/`restart` values
  on `PluginHookSessionEndReason`.
- `docs/automation/hooks.md` documents the post-`gateway:shutdown`
  `session_end` drain step and its bounded execution guarantee.

Fixes openclaw#57790.
…el + compaction lifecycle paths (openclaw#57790)

Tighten the shutdown finalizer so it actually waits for plugin handlers
under its bounded budget and so it covers every session lifecycle path,
not just the centralized emitters in `session-reset-service.ts`.

- `drainActiveSessionsForShutdown` previously called
  `emitGatewaySessionEndPluginHook`, which fires `runSessionEnd` as
  fire-and-forget (`void hookRunner.runSessionEnd(...)`). The bounded
  2 s timeout then raced only the synchronous for-loop, so the close
  handler could proceed to subsystem teardown while a database-writing
  `session_end` plugin was still in flight -- the exact ghost-session
  failure this PR is supposed to fix. Inline the emit path: build the
  `buildSessionEndHookPayload` + `resolveStableSessionEndTranscript`
  payload directly in the drain and `await hookRunner.runSessionEnd(...)`
  under the bounded race. A never-resolving handler now surfaces as
  `timedOut=true` and the close handler records `session-end-drain` as
  a warning, but is never blocked.
- The channel reply path in `src/auto-reply/reply/session.ts` and the
  compaction lifecycle helper in `src/auto-reply/reply/session-updates.ts`
  emit `session_start` / `session_end` directly through the global hook
  runner without going through `emitGatewaySessionStartPluginHook`, so
  the shutdown tracker never saw normal channel sessions or rolled-over
  compacted sessions. Wire the tracker `note` / `forget` calls into both
  paths so every public lifecycle emitter participates in the same
  tracker, and so a compacted session is both forgotten (previous id)
  and re-noted (new id) on rollover.

Tests:

- `src/gateway/drain-active-sessions-for-shutdown.test.ts` gains two
  cases: one proves the drain genuinely waits for an in-flight handler
  to settle before returning, the other proves a never-resolving handler
  is cut off at the configured budget with `timedOut=true`.

Refs openclaw#57790.
@steipete steipete force-pushed the fix/session-end-on-gateway-shutdown branch from 6ff13ee to 0e0a555 Compare May 11, 2026 16:21
@steipete steipete merged commit af0b775 into openclaw:main May 11, 2026
85 of 87 checks passed
@steipete

Copy link
Copy Markdown
Contributor

Landed via rebase onto main.

Verification:

  • pnpm test src/gateway/active-sessions-shutdown-tracker.test.ts src/gateway/drain-active-sessions-for-shutdown.test.ts src/gateway/server-close.test.ts src/gateway/server.sessions.reset-hooks.test.ts src/gateway/server.sessions.delete-lifecycle.test.ts src/cli/gateway-cli/run-loop.test.ts
  • pnpm build:strict-smoke
  • GitHub required checks accepted the PR at merge; lightweight proof/checks on the rebased head were green, with non-required broad shards still queued.

SHAs:

  • Rebased PR head: 0e0a55525fc27905b76f00ffc83bccb8ce0dbaee
  • Merge commit: af0b775274838238d59f16bb7e577b1bfa35aaae

Thanks @pandadev66!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation gateway Gateway runtime proof: supplied External PR includes structured after-fix real behavior proof. size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway shutdown/restart does not fire session_end for active sessions

2 participants