fix(gateway): restart sentinel wakes the session after restart and preserves thread routing#53940
Conversation
Greptile SummaryThis PR fixes a regression where the restart sentinel could deliver a restart note but fail to wake the interrupted session, causing the in-progress session to stall after gateway restart. The fix unconditionally calls Key changes:
One minor observation: In Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/gateway/server-restart-sentinel.ts
Line: 106-109
Comment:
**`payload.threadId` priority inverted vs outbound delivery path**
`mergeDeliveryContext(primary, fallback)` gives `payload.deliveryContext` priority, so if `payload.deliveryContext` already carries a `threadId`, `payload.threadId` is silently discarded. However the outbound delivery path at lines 156–160 does the opposite — `payload.threadId` wins via `??`:
```ts
const threadId =
payload.threadId ??
parsedTarget?.threadId ??
sessionThreadId ??
(origin?.threadId != null ? String(origin.threadId) : undefined);
```
In the edge case where both fields are present and differ, the wake-triggered heartbeat reply would land in a different thread than the outbound restart notice. Consider swapping the argument order to match the outbound priority:
```suggestion
const wakeDeliveryContext = mergeDeliveryContext(
payload.threadId != null ? { ...payload.deliveryContext, threadId: payload.threadId } : payload.deliveryContext,
undefined,
);
```
Or more simply:
```ts
const wakeDeliveryContext = normalizeDeliveryContext({
...payload.deliveryContext,
...(payload.threadId != null ? { threadId: payload.threadId } : {}),
});
```
This makes the priority consistent: `payload.threadId` (the top-level sentinel field) wins over `payload.deliveryContext.threadId`, matching the outbound path.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: preserve turnSource threadId in hea..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b8599b924
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d62919f08
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const normalizedDeliveryContext = normalizeDeliveryContext(options?.deliveryContext); | ||
| entry.lastContextKey = normalizedContextKey; | ||
| if (entry.lastText === cleaned) { | ||
| return false; |
There was a problem hiding this comment.
Include delivery context in duplicate suppression key
enqueueSystemEvent now stores deliveryContext, but the duplicate check still only compares lastText. If the same system message is queued again with a different route (for example, same restart text but a new thread/topic), the second enqueue is dropped and the old route remains in the queue; resolveSystemEventDeliveryContext will then resolve stale routing and can send the wake-triggered reply to the wrong destination. Dedup should account for normalized delivery context (or update the queued event’s context) before returning false.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think this is directionally right as a generic system-event hardening point, but I am not taking it in this PR. The route-bearing enqueue introduced here is the restart-sentinel path, and that sentinel is consumed once on startup rather than being re-enqueued repeatedly with changing delivery context. For this branch I kept the fix scoped to the concrete wake regressions; if maintainers want, I can follow up separately on broader queue dedup semantics.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c66b621e40
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 033d9c35ed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // heartbeat mode intentionally drops inherited thread IDs to avoid replying | ||
| // in stale threads (e.g., Slack thread_ts). The sentinel's delivery context | ||
| // carries the correct topic/thread ID when present. | ||
| turnSourceThreadId: params.turnSource?.threadId, |
There was a problem hiding this comment.
Gate turn-source thread IDs to target=last heartbeats
resolveHeartbeatDeliveryTarget now always passes turnSourceThreadId even when heartbeat.target is an explicit channel (not last). In that case resolveSessionDeliveryTarget can still attach the queued event's thread ID if the explicit channel matches the session's last channel, so an explicitly configured destination (e.g., fixed Telegram/Slack chat) can be forced into an unrelated old thread/topic after a restart wake. This is a regression from prior behavior (heartbeat mode dropped inherited thread IDs) and can misroute production heartbeat replies whenever a queued system event carries threadId.
Useful? React with 👍 / 👎.
Previously the restart sentinel note was delivered as an outbound message with bestEffort:true, which silently swallowed failures. Even on success, it never triggered an agent turn so the agent couldn't act on the note. Changes: - Always enqueue system event AND request heartbeat wake for the session - Switch outbound delivery to bestEffort:false with retry (2 attempts) - Log warnings on delivery failures instead of swallowing silently - Add dedicated logger for restart-sentinel subsystem
… routing Previously heartbeat mode dropped ALL threadIds, including the restart sentinel's delivery context threadId. This meant sentinel-triggered agent turns couldn't route replies back to Telegram forum topics. Changes: - resolveSessionDeliveryTarget now preserves turnSourceThreadId in heartbeat mode (but still drops session-inherited threadIds) - resolveHeartbeatDeliveryTarget passes only explicit turnSource threadId, not merged session threadId - Added tests for turn-source thread routing in heartbeat mode
033d9c3 to
958155a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 958155acb4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // heartbeat mode intentionally drops inherited thread IDs to avoid replying | ||
| // in stale threads (e.g., Slack thread_ts). The sentinel's delivery context | ||
| // carries the correct topic/thread ID when present. | ||
| turnSourceThreadId: params.turnSource?.threadId, |
There was a problem hiding this comment.
Skip turn-source thread IDs for non-last heartbeat targets
resolveHeartbeatDeliveryTarget now always passes params.turnSource?.threadId into resolveSessionDeliveryTarget, even when heartbeat.target is an explicit channel (not last). In that explicit-target case, if the configured channel matches the session’s lastChannel, heartbeat mode will still attach this thread/topic ID, so stale system-event routing metadata can force replies into an unrelated thread. This is a routing regression for explicit heartbeat destinations, which previously did not inherit thread IDs in heartbeat mode.
Useful? React with 👍 / 👎.
…ves thread routing (openclaw#53940) thanks @VACInc Co-authored-by: VACInc <3279061+VACInc@users.noreply.github.com> Co-authored-by: Val Alexander <68980965+BunsDev@users.noreply.github.com>
…ves thread routing (openclaw#53940) thanks @VACInc Co-authored-by: VACInc <3279061+VACInc@users.noreply.github.com> Co-authored-by: Val Alexander <68980965+BunsDev@users.noreply.github.com>
…ves thread routing (openclaw#53940) thanks @VACInc Co-authored-by: VACInc <3279061+VACInc@users.noreply.github.com> Co-authored-by: Val Alexander <68980965+BunsDev@users.noreply.github.com>
…ves thread routing (openclaw#53940) thanks @VACInc Co-authored-by: VACInc <3279061+VACInc@users.noreply.github.com> Co-authored-by: Val Alexander <68980965+BunsDev@users.noreply.github.com>
partial) - Add deliveryContext field to SystemEvent for routing preservation - Wake interrupted session via heartbeat after restart (requestHeartbeatNow) - Add retry logic for outbound delivery (2 attempts with 750ms delay) - Preserve threadId routing through wake path - Always enqueue wake even when delivery fails Partial port of upstream 1c9f62f: - Core: system-events deliveryContext, restart sentinel wake - Deferred: heartbeat-runner turnSource integration, targets.ts routing updates (complex changes, requires more analysis) Upstream commit: - 1c9f62f: fix(gateway): restart sentinel wakes session after restart (openclaw#53940) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* port: before_dispatch hook delivery semantics (upstream a10d587, b497f3c) - Add before_dispatch hook allowing plugins to intercept messages before model dispatch - Extract sendFinalPayload helper to unify TTS + routing logic - Preserve delivery semantics (TTS, routed delivery) when hook handles message - Use canonical hook metadata (normalized conversation id, sender id, channel) Upstream commits: - a10d587: fix: preserve before_dispatch delivery semantics (openclaw#50444) - b497f3c: fix: normalize before_dispatch conversation id Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * port: gateway restart sentinel wake + delivery context (upstream 1c9f62f partial) - Add deliveryContext field to SystemEvent for routing preservation - Wake interrupted session via heartbeat after restart (requestHeartbeatNow) - Add retry logic for outbound delivery (2 attempts with 750ms delay) - Preserve threadId routing through wake path - Always enqueue wake even when delivery fails Partial port of upstream 1c9f62f: - Core: system-events deliveryContext, restart sentinel wake - Deferred: heartbeat-runner turnSource integration, targets.ts routing updates (complex changes, requires more analysis) Upstream commit: - 1c9f62f: fix(gateway): restart sentinel wakes session after restart (openclaw#53940) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * port: prefer freshest duplicate session rows (upstream f48571b, 40f820f) - Add resolveFreshestSessionStoreMatchFromStoreKeys to prefer newest updatedAt - Add resolveFreshestSessionEntryFromStoreKeys wrapper - Use in sessions.preview for duplicate row handling Handles case-insensitive and legacy alias keys by sorting duplicates by updatedAt timestamp and returning the freshest entry. Upstream commits: - f48571b: fix: prefer freshest duplicate rows in session loads - 40f820f: fix: prefer freshest duplicate session rows in reads Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: remove delivery-queue dependency (not yet ported) Remove enqueueDelivery/ackDelivery/failDelivery usage from server-restart-sentinel.ts as delivery-queue.ts is not yet ported to our codebase. Keep retry logic but use agentCommand directly. Resolves build failure. * port: isolate channel startup failures (upstream 30e80fb) Wrap individual channel startup in try-catch so one broken channel (e.g. WhatsApp missing runtime) doesn't block other channels (e.g. Discord) from starting. Changes: - Add try-catch in startChannels loop - Log per-channel startup errors - Continue starting remaining channels after failure Upstream commit: - 30e80fb: fix: isolate channel startup failures (openclaw#54215) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: implement runBeforeDispatch without runClaimingHook - Implement runBeforeDispatch directly using getHooksForName - Remove GroupName check (not in our MsgContext) - Fix TypeScript build errors Resolves build failures after before_dispatch hook port. * port: per-model cooldown scope + stepped backoff (upstream 8440122) Scope rate-limit cooldowns per model so one 429 no longer blocks every model on the same auth profile. Replace exponential 1min→1h escalation with stepped 30s/1min/5min ladder. Key changes: - Add cooldownReason and cooldownModel to ProfileUsageStats type - Implement stepped backoff: 30s → 1min → 5min (was: 1min → 5min → 25min → 60min) - Add model-aware cooldown bypass in isProfileInCooldown() - Track model scope when marking auth profile failures - Pass modelId through markAuthProfileFailure and related calls - Update isProfileInCooldown calls in model-fallback and pi-embedded-runner to pass forModel Upstream ref: 8440122 * port: surface mid-turn 429 rate limits (upstream 4ae4d1f partial) Surface rate limit and overload errors that occur mid-turn (after tool calls) instead of silently returning an empty response. Only applies when the assistant produced no valid (non-error) reply text, so tool-level rate-limit messages don't override a successful turn. Changes: - Add isReasoning field to EmbeddedPiRunResult payload type - Detect mid-turn rate limits in agent-runner-execution.ts when there's no valid content (checking for text/media, excluding errors/reasoning) - Import isRateLimitErrorMessage and isOverloadedErrorMessage - Replace empty responses with user-facing rate limit message Note: Skipped upstream's incomplete turn detection in run.ts (detecting stopReason=toolUse with no payloads) as it requires deep understanding of our specific agent loop structure and could cause false positives. The agent-runner-execution.ts check catches the issue at final output. Upstream ref: 4ae4d1f (partial port) * port: isolate session:patch hook payload (upstream 765182d, 3e2e9bc partial) Changes: - Add hasInternalHookListeners() to check for listeners before cloning - Add session:patch hook with structuredClone to isolate payload - Add SessionPatchHookEvent and isSessionPatchEvent() type guard - Only clone and dispatch when listeners are registered (performance) Why structuredClone: Fire-and-forget hooks cannot mutate objects used by the response path. Deep cloning sessionEntry, patch, and cfg prevents plugin corruption. Skip model default reasoning guards (6c04ce3, b91374e): Those patches require agentEntry.reasoningDefault and modelState.resolveDefaultReasoningLevel() which haven't been ported yet. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ves thread routing (openclaw#53940) thanks @VACInc Co-authored-by: VACInc <3279061+VACInc@users.noreply.github.com> Co-authored-by: Val Alexander <68980965+BunsDev@users.noreply.github.com>
…ves thread routing (openclaw#53940) thanks @VACInc Co-authored-by: VACInc <3279061+VACInc@users.noreply.github.com> Co-authored-by: Val Alexander <68980965+BunsDev@users.noreply.github.com>
…ves thread routing (openclaw#53940) thanks @VACInc Co-authored-by: VACInc <3279061+VACInc@users.noreply.github.com> Co-authored-by: Val Alexander <68980965+BunsDev@users.noreply.github.com>
Summary
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause / Regression History (if applicable)
git blame, prior PR, issue, or refactor if known):refactor: unify outbound session context wiringplus earlier restart-sentinel routing/thread fixes.Regression Test Plan (if applicable)
src/gateway/server-restart-sentinel.test.ts,src/infra/heartbeat-runner.ghost-reminder.test.ts,src/infra/outbound/targets.test.ts,src/infra/system-events.test.tsUser-visible / Behavior Changes
Security Impact (required)
Yes/No): NoYes/No): NoYes/No): NoYes/No): NoYes/No): NoYes, explain risk + mitigation:Repro + Verification
Environment
agents.defaults.heartbeat.target=lastSteps
Expected
Actual
Evidence
Attach at least one:
Human Verification (required)
pnpm buildpasses.Review Conversations
Compatibility / Migration
Yes/No): YesYes/No): NoYes/No): NoFailure Recovery (if this breaks)
src/gateway/server-restart-sentinel.ts,src/infra/system-events.ts,src/infra/outbound/targets.ts,src/infra/heartbeat-runner.tsRisks and Mitigations