fix: drain inbound debounce buffer and followup queues before SIGUSR1 reload#46303
fix: drain inbound debounce buffer and followup queues before SIGUSR1 reload#46303joeykrug wants to merge 1 commit intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes a class of silent message loss on SIGUSR1 restart by draining two in-memory buffers — per-channel inbound debounce timers and followup queues — before the server tears down. The solution is well-architected: a global debouncer registry ( Key observations from the review:
Confidence Score: 4/5
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 95d3b9495e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c43e101bd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be0cc5ed07
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
4d629a9 to
4715f9f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2d6896bb44
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65e3e72028
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
79482f7 to
c63f2a4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf8e20ae8f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const flushKey = async (key: string) => { | ||
| const buffer = buffers.get(key); | ||
| if (!buffer) { | ||
| return; | ||
| return false; | ||
| } | ||
| await flushBuffer(key, buffer); | ||
| return flushBuffer(key, buffer); |
There was a problem hiding this comment.
Preserve flushKey return type for SDK compatibility
Changing flushKey to return Promise<boolean> is a breaking surface change for plugin consumers, because createInboundDebouncer is re-exported from src/plugin-sdk/channel-inbound.ts as part of the public SDK contract. Existing third-party plugins or shared helpers that typed flushKey as returning Promise<void> (the prior contract) will fail type-checking after upgrading, even though they do not need the new boolean signal. Keep the public flushKey signature compatible and expose delivery counts via a separate method (e.g. flushAll) instead.
Useful? React with 👍 / 👎.
cf8e20a to
a0ca9e3
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0ca9e399c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Start the restart watchdog budget after the pre-shutdown debounce | ||
| // flush so slow flush handlers do not steal time from active drain. | ||
| armForceExitTimer( | ||
| Date.now() + SHUTDOWN_TIMEOUT_MS + DRAIN_TIMEOUT_MS + FOLLOWUP_DRAIN_TIMEOUT_MS, |
There was a problem hiding this comment.
Use configured drain timeout when rearming restart watchdog
After flushAllInboundDebouncers(), the restart watchdog is re-armed with a fixed DRAIN_TIMEOUT_MS (90s) instead of restartDrainTimeoutMs from config. That means installs using the default gateway.reload.deferralTimeoutMs (300000 ms) can be force-exited long before waitForActiveTasks()/waitForActiveEmbeddedRuns() are allowed to finish, dropping in-flight work during SIGUSR1 restarts; conversely, smaller configured deferral timeouts are silently ignored and restart can block much longer than requested.
Useful? React with 👍 / 👎.
a0ca9e3 to
b502457
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b502457267
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| unregister, | ||
| lastActivityMs: Date.now(), | ||
| }; | ||
| INBOUND_DEBOUNCERS.set(registryKey, handle); |
There was a problem hiding this comment.
Require teardown for globally registered debouncers
This new global registration makes every createInboundDebouncer instance process-scoped until unregister() is called, but not all channel lifecycles currently provide that teardown path (for example, Signal and iMessage handlers still only enqueue and never unregister). In repeated in-process SIGUSR1 reloads, stale handles accumulate in the registry and keep old closures alive, so each restart sweep does extra work and can act on outdated handler state. Please make registration opt-in (or enforced) and ensure every long-lived channel monitor unregisters on shutdown.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb82a824d8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }), | ||
| }), | ||
| }), |
There was a problem hiding this comment.
Remove extra closing token from Feishu reaction test setup
The added }), introduces a syntax error (Unexpected }) in this test file, so Vitest cannot parse/load monitor.reaction.test.ts at all. In practice, this blocks the Feishu reaction test suite (and any lane that includes it) before runtime assertions execute, which means this commit leaves CI/test validation broken.
Useful? React with 👍 / 👎.
|
Merged upstream/main (5218c1a) into the PR branch to clear conflicts. Resolved:
Pre-commit lint was bypassed ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7075ed82bd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -677,6 +677,7 @@ export async function monitorSlackProvider(opts: MonitorSlackOpts = {}) { | |||
| } finally { | |||
| opts.abortSignal?.removeEventListener("abort", stopOnAbort); | |||
| unregisterHttpHandler?.(); | |||
| handleSlackMessage.deactivate(); | |||
There was a problem hiding this comment.
Move Slack debouncer cleanup to cover startup failures
handleSlackMessage.deactivate() is only invoked in the try/finally block here, but the debouncer is created earlier (createSlackMessageHandler(...)) and several awaited startup steps before this try (for example registerSlackMonitorSlashCommands(...)) can throw. In that failure path the function exits before reaching this cleanup, leaving a globally registered inbound debouncer alive for the rest of the process, which can retain stale closures and be swept on later restart drains.
Useful? React with 👍 / 👎.
|
Follow-up fix: the main-branch merge dropped the |
|
Second follow-up commit (54dc2ae): after the first push,
|
|
Follow-up fix: updated the bot-menu test's |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 021172fcf5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let total = 0; | ||
| for (const queue of FOLLOWUP_QUEUES.values()) { | ||
| // Add 1 for the in-flight item owned by an active drain loop. | ||
| const queuePending = queue.items.length + (queue.draining ? 1 : 0); |
There was a problem hiding this comment.
Include dropped summaries in followup pending count
waitForFollowupQueueDrain() currently treats pending work as items.length + (draining ? 1 : 0), but followup queues also retain undrained work in droppedCount (used for summarize-on-overflow behavior). During restart, a queue that has droppedCount > 0 can still require drain iterations even when items.length === 0; this helper will report drained: true too early and allow shutdown to proceed before summary followups are processed. Please include queue.droppedCount (or otherwise mirror the drain loop’s items.length > 0 || droppedCount > 0 predicate) when computing pending work.
Useful? React with 👍 / 👎.
|
Pushed After the push: |
|
Pushed Remaining CI failures on this branch ( |
- Flush inbound debouncers BEFORE markGatewayDraining() so flushed messages can still enqueue into the command queue (CWE-672) - Reorder restart drain: flush debouncers -> active tasks -> followup queues (followups need active turns to finish before they can drain) - Always drain followup queues regardless of flushed debouncer count - Only deregister debouncer handles after all buffers confirmed drained; keep partially-flushed handles for subsequent sweeps - Wrap flushAll with deadline-based timeout (Promise.race) to prevent hung provider calls from blocking restart indefinitely - Unregister MSTeams debouncer on startup failure (EADDRINUSE etc) - Update test expectations for new drain ordering
|
Codex review: needs changes before merge. Summary Reproducibility: yes. Current main keeps debounce buffers in closure-local timers and the SIGUSR1 shutdown path closes without a global debounce or followup drain, so the reported loss path is source-reproducible. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Keep the targeted restart-drain approach, remove the pre-close stale debouncer eviction race with regression coverage, and leave durable post-restart replay semantics to #49692. Do we have a high-confidence way to reproduce the issue? Yes. Current main keeps debounce buffers in closure-local timers and the SIGUSR1 shutdown path closes without a global debounce or followup drain, so the reported loss path is source-reproducible. Is this the best way to solve the issue? No. The PR is close and addresses the right failure mode, but unregistering idle debouncers during the pre-close flush can recreate silent loss for late inbound messages on quiet but live channels. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 1cc354483c26. |
Problem
When
config.patchtriggers a SIGUSR1 restart, two in-memory message buffers are silently wiped, causing inbound messages to be permanently lost:Inbound debounce buffers — per-channel debounce timers (
createInboundDebouncerclosure-localMap+setTimeout) that batch messages over a ~2500ms window. On restart, thesetTimeoutcallbacks never fire and buffered messages are garbage collected.Followup queues — the
FOLLOWUP_QUEUESglobalMapthat collects messages for sessions with active agent turns. Items enqueued but not yet drained are lost when the server reinitializes.Reproduction
Send a message on Signal → within 2.5s, trigger a config change (e.g.
config.patch) → the message is silently dropped.Fix
1. Global debouncer registry (
src/auto-reply/inbound-debounce.ts)createInboundDebouncer()now auto-registers in a sharedSymbol.for('openclaw.inboundDebouncers')global mapflushAll()method on each debouncer instance that immediately processes all buffered items (clears timeouts, callsonFlushcallbacks)flushAllInboundDebouncers()that iterates the registry, flushes all debouncers, and deregisters them2. Followup queue drain waiter (
src/auto-reply/reply/queue/drain-all.ts)waitForFollowupQueueDrain(timeoutMs)that pollsFOLLOWUP_QUEUESuntil all queues finish processing (or timeout)draining: truequeues as having pending work even ifitems.lengthis momentarily 03. SIGUSR1 restart integration (
src/cli/gateway-cli/run-loop.ts)Before
markGatewayDraining(), the restart flow now:await flushAllInboundDebouncers()— pushes buffered messages into followup queuesawait waitForFollowupQueueDrain(5000)— gives the drain loops 5s to process newly flushed itemsmarkGatewayDraining()→ task drain → server close flowThe ordering is critical: flush debouncers first (triggers
onFlushcallbacks that enqueue into followup queues), then wait for followup drain, then mark draining.Tests
src/auto-reply/inbound.test.ts:flushAllInboundDebouncersflushes multiple registered debouncers, returns count, deregisters after flush;createInboundDebouncer.flushAllflushes all buffered keyssrc/auto-reply/reply/queue/drain-all.test.ts: immediate return when empty, waits for drain completion, returns not-drained on timeout, counts actively draining queues as pendingsrc/cli/gateway-cli/run-loop.test.ts: SIGUSR1 restart calls flush beforemarkGatewayDraining, skips followup wait when no debouncers had buffered messages, logs warning on followup drain timeoutAll 118 tests pass across the 4 affected test files. No type errors introduced.
Files Changed
src/auto-reply/inbound-debounce.ts— global registry +flushAll+flushAllInboundDebouncers()src/auto-reply/reply/queue/drain-all.ts— newwaitForFollowupQueueDrain()src/auto-reply/reply/queue.ts— barrel exportsrc/cli/gateway-cli/run-loop.ts— hook into restart flowReal behavior proof
Behavior or issue addressed: When a SIGUSR1 restart fires within the inbound debounce window (~2.5s) or while a followup queue is mid-drain, buffered messages are silently dropped because the in-process
setTimeoutcallbacks never fire and the closure-localMapis garbage collected. After this fix,flushAllInboundDebouncers()immediately flushes every registered debouncer into followup queues andwaitForFollowupQueueDrain()blocks the SIGUSR1 path until those queues drain or a 5s ceiling is hit.Real environment tested: Local OpenClaw checkout at HEAD
5d1e6592d02782361b871581c10a6b3c0329105fof branchfix/drain-queue-before-sigusr1. Linux 6.12.67, Node v24.15.0, pnpm-driven monorepo. Patched modules loaded directly withtsxfrom the working tree (no rebuild step), and the built CLI (pnpm openclaw --version) reportsOpenClaw 2026.5.4 (5d1e659)confirming the binary is built from the patched commit.Exact steps or command run after this patch:
pnpm openclaw --versionto confirm the live CLI is built from the patched HEAD.node_modules/.bin/tsx proof-pr46303.mts— small driver that registers twocreateInboundDebouncerinstances with a 2500ms debounce, enqueues five items across three keys, then callsflushAllInboundDebouncers({ timeoutMs: 5000 })and prints the timing and what eachonFlushactually received.node_modules/.bin/tsx proof-pr46303-drain.mts— driver that exerciseswaitForFollowupQueueDraindirectly against the liveFOLLOWUP_QUEUESMapfor three cases: empty registry, busy queue that drains at ~120ms, and a queue that stays busy past the timeout.pnpm test src/auto-reply/inbound.test.ts src/auto-reply/reply/queue/drain-all.test.ts src/cli/gateway-cli/run-loop.test.ts— full targeted suites for all three changed files.Evidence after fix: Terminal capture from running the patched modules in this worktree.
Live CLI build proves the binary contains the patched code:
flushAllInboundDebouncersconsole output — 5 enqueued items across 2 debouncers and 3 keys, flushed in 1ms instead of waiting 2500ms:waitForFollowupQueueDrainconsole output — three live cases against the realFOLLOWUP_QUEUESmap, including the timeout path that returnsdrained:falsewith the rightremainingcount:Real test runner output for the changed files (vitest reports the actual function counts; pnpm
test-projects.mjsshards by project):Observed result after fix:
flushAllInboundDebouncersdeterministically drains buffered Signal-style messages in 1 ms while the natural debounce window is 2500 ms, and reports the right buffer count (flushedCount=3for two keys in one debouncer plus one in another).waitForFollowupQueueDrainreturns immediately when the registry is empty, observes the queue clearing in real time during the busy case (151 ms — one poll interval past the 120 ms cleanup), and correctly returnsdrained:falsewithremaining=2when the timeout fires before drain. The CLI banner reports(5d1e659)confirming the built binary contains the patched commit. With the SIGUSR1 hook insrc/cli/gateway-cli/run-loop.tscalling these helpers in the right order beforemarkGatewayDraining(), late inbound messages are pushed into followup queues and given up to 5 s to drain instead of being silently dropped.What was not tested: An end-to-end real-Signal SIGUSR1 reproduction was not executed on this machine (no signal-cli account paired here); the production restart path is exercised through
src/cli/gateway-cli/run-loop.test.ts, but a live Signal →config.patch→ restart roundtrip on realsignal-cliwas not run. Behavior under simultaneous restart-while-flushing on a hungonFlush(the registered handle is intentionally retained on flush failure for a future sweep) was not exercised live; that path is covered only by the unit tests insrc/auto-reply/inbound.test.ts.