fix(gateway): fire typed session_end on shutdown/restart for active sessions#80673
Conversation
a8eb12a to
0aeadb6
Compare
|
@clawsweeper re-review |
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path. Real behavior proof Next step before merge Review detailsBest possible solution: Retry the Codex review after fixing the execution failure. Do we have a high-confidence way to reproduce the issue? Unclear. The review failed before ClawSweeper could establish a reproduction path. Is this the best way to solve the issue? Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 80d1df6e6830. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
Fix the two P1 findings the bot raised: - Await each session_end handler during the shutdown drain so the bounded total-timeout races real plugin work instead of just the synchronous for-loop. The drain now inlines the runSessionEnd call (uses the same buildSessionEndHookPayload + resolveStableSessionEndTranscript helpers the centralized emitter uses) and surfaces a hung plugin handler as timedOut=true. New unit cases prove both that the drain waits for an in-flight handler to settle and that a never-resolving handler is cut off at the configured budget. - Plumb the active-session tracker through the channel reply path (`src/auto-reply/reply/session.ts`) and the compaction lifecycle helper (`src/auto-reply/reply/session-updates.ts`), which previously called `hookRunner.runSessionStart/End` directly without going through `emitGatewaySessionStartPluginHook`. Now every public session lifecycle emitter notes / forgets the same tracker, so normal channel sessions and compacted sessions are both drained on shutdown and never double-fired by a subsequent drain after a normal close. Refs openclaw#57790.
d741dcd to
44e0d21
Compare
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…essions (openclaw#57790) `session_end` was only fired when a session was replaced, reset, deleted, or compacted -- the gateway shutdown/restart paths closed the process without enumerating active sessions, so downstream `session_end` plugins (e.g. claude-mem) accumulated ghost rows in `active` state across restarts. Issue reporter saw 11 orphaned sessions cause 63 timeouts/day from agent pool exhaustion. Add an in-memory active-session tracker (`src/gateway/active-sessions-shutdown-tracker.ts`) populated by `emitGatewaySessionStartPluginHook` and forgotten unconditionally by `emitGatewaySessionEndPluginHook` (even when no plugin listens), so any session that has already been finalized through the normal lifecycle is never re-fired by the shutdown drain. The close handler then calls a new `drainActiveSessionsForShutdown({ reason })` in `session-reset-service.ts` between the `gateway:shutdown`/`gateway:pre-restart` lifecycle hooks and the subsystem teardown steps; the drain races a bounded 2 s total timeout so a slow plugin cannot block SIGTERM/SIGINT, surfacing the timeout as a `session-end-drain` warning on the shutdown result. Extend `PluginHookSessionEndReason` with `"shutdown"` and `"restart"` so plugins can distinguish a graceful close from a planned restart; the close handler picks `restart` when `restartExpectedMs` is set and `shutdown` otherwise. Update `emitGatewaySessionStartPluginHook` to also accept `storePath`, `sessionFile`, and `agentId` so the shutdown drain can build the same `session_end` payload shape the normal lifecycle path emits, and update the existing call sites in `session-reset-service.ts` and `server-methods/sessions.ts` to pass those fields through. Tests: - `src/gateway/active-sessions-shutdown-tracker.test.ts` (new) -- tracker insert/forget/clear semantics, idempotent re-noting, empty-id guard, snapshot isolation. - `src/gateway/drain-active-sessions-for-shutdown.test.ts` (new) -- drain fires `session_end` with the right reason for every tracked session, skips sessions already finalized via reset/delete/compaction, and still forgets sessions even when no `session_end` plugin is registered. - `src/gateway/server-close.test.ts` -- four new cases covering the shutdown/restart drain wiring, the bounded timeout warning, and the drain-skipped-when-no-helper case. Docs: - `docs/plugins/hooks.md` documents the new `shutdown`/`restart` values on `PluginHookSessionEndReason`. - `docs/automation/hooks.md` documents the post-`gateway:shutdown` `session_end` drain step and its bounded execution guarantee. Fixes openclaw#57790.
…el + compaction lifecycle paths (openclaw#57790) Tighten the shutdown finalizer so it actually waits for plugin handlers under its bounded budget and so it covers every session lifecycle path, not just the centralized emitters in `session-reset-service.ts`. - `drainActiveSessionsForShutdown` previously called `emitGatewaySessionEndPluginHook`, which fires `runSessionEnd` as fire-and-forget (`void hookRunner.runSessionEnd(...)`). The bounded 2 s timeout then raced only the synchronous for-loop, so the close handler could proceed to subsystem teardown while a database-writing `session_end` plugin was still in flight -- the exact ghost-session failure this PR is supposed to fix. Inline the emit path: build the `buildSessionEndHookPayload` + `resolveStableSessionEndTranscript` payload directly in the drain and `await hookRunner.runSessionEnd(...)` under the bounded race. A never-resolving handler now surfaces as `timedOut=true` and the close handler records `session-end-drain` as a warning, but is never blocked. - The channel reply path in `src/auto-reply/reply/session.ts` and the compaction lifecycle helper in `src/auto-reply/reply/session-updates.ts` emit `session_start` / `session_end` directly through the global hook runner without going through `emitGatewaySessionStartPluginHook`, so the shutdown tracker never saw normal channel sessions or rolled-over compacted sessions. Wire the tracker `note` / `forget` calls into both paths so every public lifecycle emitter participates in the same tracker, and so a compacted session is both forgotten (previous id) and re-noted (new id) on rollover. Tests: - `src/gateway/drain-active-sessions-for-shutdown.test.ts` gains two cases: one proves the drain genuinely waits for an in-flight handler to settle before returning, the other proves a never-resolving handler is cut off at the configured budget with `timedOut=true`. Refs openclaw#57790.
6ff13ee to
0e0a555
Compare
|
Landed via rebase onto Verification:
SHAs:
Thanks @pandadev66! |
Fixes #57790.
session_endwas only fired when a session was replaced, reset, deleted, or compacted -- the gateway shutdown/restart paths closed the process without enumerating active sessions, so downstreamsession_endplugins (e.g. claude-mem) accumulated ghost rows inactivestate across restarts. Issue reporter saw 11 orphaned sessions cause 63 timeouts/day from agent pool exhaustion.ClawSweeper triage explicitly cleared this as "Queue a focused fix PR because the affected path, compatibility boundary, likely files, and regression tests are clear" with reproducibility confirmed at the source level.
Change
src/gateway/active-sessions-shutdown-tracker.ts) keyed bysessionId. Populated byemitGatewaySessionStartPluginHookand forgotten unconditionally byemitGatewaySessionEndPluginHook(even when no plugin currently listens), so any session that has already been finalized through replace / reset / delete / compaction is never re-fired by the shutdown drain.drainActiveSessionsForShutdown({ reason })insrc/gateway/session-reset-service.ts. Iterates the tracker, fires the sameemitGatewaySessionEndPluginHookthe normal lifecycle path uses, and races a bounded 2 s total timeout so a slow plugin cannot block SIGTERM/SIGINT.createGatewayCloseHandlerwires the drain between the existinggateway:shutdown/gateway:pre-restartlifecycle hooks and the subsystem teardown steps. Reason resolves torestartwhenrestartExpectedMsis set, otherwiseshutdown. Drain timeout surfaces on the shutdown result as asession-end-drainwarning so operators can spot a hanging plugin without losing the process.PluginHookSessionEndReasongains"shutdown"and"restart".emitGatewaySessionStartPluginHooknow also acceptsstorePath,sessionFile, andagentIdso the shutdown drain can build the samesession_endpayload shape the normal lifecycle emits; the two existing call sites (server-methods/sessions.tscreate path andsession-reset-service.tsreset path) pass those through.Docs:
docs/plugins/hooks.mddocuments the newshutdown/restartreason values.docs/automation/hooks.mddocuments the post-gateway:shutdowndrain and its bounded execution.Verification
pnpm test src/gateway/active-sessions-shutdown-tracker.test.ts src/gateway/drain-active-sessions-for-shutdown.test.ts src/gateway/server-close.test.ts-- 35/35 pass (19 new tracker + drain cases, plus 4 new close-handler cases for the shutdown/restart drain wiring).pnpm test src/gateway/server.sessions.reset-hooks.test.ts src/gateway/server.sessions.delete-lifecycle.test.ts src/cli/gateway-cli/run-loop.test.ts-- all clawsweeper-listed acceptance suites green.pnpm exec oxfmt --check --threads=1clean on every touched file.Real behavior proof
Behavior addressed:
session_endplugin hooks (claude-mem and the rest of the documented publicsession_endcontract) firing for every session that was active when the gateway process is stopped or restarted, with reason"shutdown"or"restart", and never double-firing for a session that was already finalized through replace / reset / delete / compaction. Issue reporter on #57790 cited 11 orphaned sessions causing 63 timeout errors in a single day from agent pool exhaustion when claude-mem never received the matchingsession_end.Real environment tested: local Linux 6.17 (Node 22.22.2). Live Node process with the real production helpers from
src/gateway/session-reset-service.ts--emitGatewaySessionStartPluginHook,emitGatewaySessionEndPluginHook, and the newdrainActiveSessionsForShutdown. The plugin host-hook runner is installed via the documentedSymbol.for("openclaw.plugins.hook-runner-global-state")singleton (the same hook the bundled plugin loader uses at gateway startup); the test runner is a tiny in-process listener that records everysession_endevent the gateway actually emits. No vitest mocks, no fake timers.Exact steps or command run after this patch:
Where
/tmp/proof-57790.mtsis a small driver that installs a claude-mem-shaped listener on the global hook-runner singleton, callsemitGatewaySessionStartPluginHookfor three real sessions (Telegram / Discord / Slack-style), closes one of them normally viaemitGatewaySessionEndPluginHook(reason: "reset"), then triggersdrainActiveSessionsForShutdown({ reason: "shutdown" })to simulate the SIGTERM path.Evidence after fix: terminal output captured directly from the live Node process. Each
session_endline is the event payload the listener received from the production emit helpers (not a mock).Observed result after fix:
session_startnow also receive a matchingsession_end:sess-B-002with reasonreset(normal close),sess-A-001andsess-C-003with reasonshutdown(shutdown finalizer). Before this PR the shutdown path emitted nothing, sosess-Aandsess-Cwould have stayed inactivestate until claude-mem (or the operator's band-aid cron) noticed the ghost rows.sess-B-002is not re-emitted with reasonshutdowneven though it receivedsession_start. The tracker forgot it the moment its normalsession_endfired, so the shutdown drain has nothing left to do for it. This is exactly the "must avoid double-firing for sessions already finalized by reset, delete, compaction, or replacement" constraint clawsweeper listed.emitted=0,captured=0. The tracker is empty, so a second SIGTERM cycle on the same gateway would not double-fire either.What was not tested: an end-to-end SIGTERM against a fully booted
openclaw gatewayprocess driving a real claude-mem plugin install -- the runtime proof above exercises the production emit helpers, the production global hook-runner singleton, and the new bounded drain, but uses an in-process listener instead of a separate plugin process. The unit suite (active-sessions-shutdown-tracker.test.ts,drain-active-sessions-for-shutdown.test.ts,server-close.test.ts) covers the close-handler wiring, the per-reason fan-out, and the timeout-warning path; the existingserver.sessions.reset-hooks.test.tsandserver.sessions.delete-lifecycle.test.tscontinue to pass and prove the normal lifecycle still emitssession_endcorrectly so the tracker's forget semantics stay correct.