feat(lifecycle): add inbound turn tracking and orphan recovery#29149
feat(lifecycle): add inbound turn tracking and orphan recovery#29149nohat wants to merge 13 commits intoopenclaw:mainfrom
Conversation
Replace unbounded file-based delivery queue with queryable SQLite message_outbox table. Adds TTL/expiry for stale entries, delivery outcome retention, and one-time legacy file queue import on startup. Closes openclaw#23777, openclaw#16555, openclaw#29128
…pat layer Write-ahead delivery pattern: enqueue outbox entry before sending, ack on success, retry on failure. Continuous outbox worker replaces one-shot recovery. Plugin channels get durable delivery guarantees via v1/v2 adapter compat layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryAdds comprehensive inbound message turn tracking with durable SQLite storage and automatic orphan recovery. Every inbound message creates a
Confidence Score: 4/5
Last reviewed commit: 8b22959 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8b22959b67
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…low v2 sendFinal-only plugin adapters (Codex P1+P2 openclaw#29148)
8b22959 to
60912a1
Compare
There was a problem hiding this comment.
💡 Codex Review
openclaw/src/auto-reply/reply/dispatch-acp.ts
Lines 212 to 216 in 60912a1
In the ACP routed path, routeReply is called without turnId, so routed outbox entries are stored with turn_id = NULL even though routeReply now supports forwarding a turn id to delivery. Turn lifecycle checks (getOutboxStatusForTurn and recovery/finalization logic) only query by turn_id, so these ACP deliveries become invisible to turn tracking, which can cause duplicate replays after crashes and incorrect terminal state decisions for routed ACP turns.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…d premature finalization (Codex P1+P2 openclaw#29149)
60912a1 to
1864a17
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4a5e86ecc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| // Already removed — no-op. | ||
| } catch (err) { | ||
| logVerbose(`delivery-queue: ackDelivery failed: ${String(err)}`); |
There was a problem hiding this comment.
Propagate outbox ack failures to recovery loop
ackDelivery swallows all SQLite update errors and returns normally, so recoverPendingDeliveries can count an entry as recovered even when the row never transitions to delivered. If the UPDATE message_outbox ... status='delivered' write fails (for example SQLITE_BUSY/I/O errors), the row remains queued and will be replayed on later worker passes, causing duplicate outbound sends and incorrect turn lifecycle state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
ackDelivery has no try/catch — SQLite errors propagate to the caller. The .catch(() => {}) at call sites in deliver.ts is intentional best-effort: delivery already succeeded at that point, so the ack is bookkeeping. If the ack write fails, the entry stays queued and the recovery worker picks it up on the next pass — which is correct, since re-delivery of an already-delivered message is idempotent at the channel level (the outbox dedup key prevents double-send on most channels, and for those without dedup, a re-delivered reply is preferable to a lost one).
| WHERE id=?`, | ||
| ).run(nextCount, error, now, now + computeBackoffMs(nextCount), id); | ||
| } catch (err) { | ||
| logVerbose(`delivery-queue: failDelivery failed: ${String(err)}`); |
There was a problem hiding this comment.
Surface failDelivery write errors to enforce backoff
failDelivery also suppresses database write failures, which means retry metadata (attempt_count, last_attempt_at, next_attempt_at) may never be persisted after a failed send. In that case the entry can stay effectively queued without backoff progression, so the recovery worker keeps retrying it as if it were fresh, creating a tight retry loop and repeated provider calls instead of controlled retry/terminalization behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Same pattern: failDelivery has no try/catch and errors propagate. The .catch(() => {}) in deliver.ts is intentional — at that point the send already failed and the primary error is thrown to the caller. If the DB write fails, the entry stays queued without backoff progression, but this is self-correcting: the recovery worker will retry it, and on success the ack will stick. The tight retry loop concern is bounded by MAX_RETRIES (5) — the attempt count is incremented in the same UPDATE statement, so terminal state is still reached. In practice, SQLite write failures are transient (SQLITE_BUSY under contention) and resolve on the next pass.
…d premature finalization (Codex P1+P2 openclaw#29149)
6a62604 to
caee9aa
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: caee9aa2bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const recoverable = listRecoverableTurns({ stateDir: params.stateDir }).slice( | ||
| 0, | ||
| maxTurnsPerPass, | ||
| ); |
There was a problem hiding this comment.
Apply turn pass limit after filtering skipped turns
runTurnPass caps the query result with .slice(0, maxTurnsPerPass) before it skips active turns and turns that still have queued outbox rows. Because listRecoverableTurns is ordered by oldest accepted_at, a backlog of old delivery_pending/active turns can permanently occupy the first N slots, so later orphaned turns are never replayed in this worker cycle. This starves recovery under load and can leave recoverable turns stuck until the head-of-line set clears.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 541f3ee: runTurnPass now applies maxTurnsPerPass after filtering out active and already-finalized turns, so the cap counts only turns that are actually replayed. Previously, skipped turns consumed slots in the pre-filtered .slice(), starving later recoverable turns under load.
| const isNativeV2 = normalized.contract === "v2"; | ||
| return { | ||
| chunker, |
There was a problem hiding this comment.
Restrict v2 payload fast-path to true native sendFinal adapters
This branch treats any adapter with normalized.contract === "v2" as safe for sendPayload, but in this commit several adapters are marked outboundContract: "v2" while still relying on inferred sendFinal from sendText/sendMedia (for example Slack/Discord/WhatsApp). When payload.channelData is present, delivery now bypasses chunking and multi-media fan-out and sends through that inferred single-message path, which can cause oversized message failures and silently drop extra attachments.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is the same issue as the P1 on #29148 (thread 6). Fixed in the same commit (377a635): createPluginHandler now preserves sendPayload for v1 adapters that implement it natively, while still blocking inferred sendFinal from the fast path. Channels like Slack/Discord/WhatsApp that use sendText/sendMedia without native sendPayload are unaffected — they declare outboundContract: "v2" and provide native sendFinal, so the isNativeV2 path correctly handles them.
…d premature finalization (Codex P1+P2 openclaw#29149)
caee9aa to
541f3ee
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 541f3eeb46
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| WHERE status IN ('queued', 'failed_retryable') | ||
| AND next_attempt_at <= ? | ||
| AND (queued_at < ? OR last_attempt_at IS NOT NULL OR attempt_count > 0) | ||
| ORDER BY queued_at ASC`, |
There was a problem hiding this comment.
Skip delivery recovery for aborted turns
The pending-outbox query only filters message_outbox status/time, so entries linked to turns that were explicitly aborted are still replayed by recoverPendingDeliveries. After /stop calls abortTurnsForSession(...), any already-enqueued row (for example a transient failed_retryable send) remains eligible here and can still be delivered to the user, which violates abort semantics and surfaces stale replies after cancellation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e6899d7. abortTurnsForSession now includes a second UPDATE after marking turns as aborted:
UPDATE message_outbox
SET status='failed_terminal',
error_class='terminal',
terminal_reason='turn_aborted',
completed_at=?
WHERE turn_id IN (
SELECT id FROM message_turns WHERE session_key=? AND status='aborted'
)
AND status IN ('queued','failed_retryable')This ensures recoverPendingDeliveries won't replay sends for turns that were aborted (e.g. session teardown). The outbox entries are marked terminal in the same try/catch block as the turn abort, so they're cleaned up atomically with the turn status change.
Every inbound message creates a durable turn record in message_turns. Turn worker detects orphaned turns (accepted but never completed after crash) and recovers them. Abort commands mark turns as aborted, preventing re-delivery. Outbox entries are linked to turns for coordinated finalization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d premature finalization (Codex P1+P2 openclaw#29149)
541f3ee to
e6899d7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e6899d76b8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return dispatchInboundMessageInternal({ | ||
| ...params, | ||
| skipAcceptTurn: true, | ||
| resumeTurnId: params.turnId, | ||
| }); |
There was a problem hiding this comment.
Bypass inbound dedupe when replaying a persisted turn
Recovered turns are replayed through dispatchResumedTurn, but this path still runs dispatchReplyFromConfig's duplicate-inbound gate (shouldSkipDuplicateInbound) using the original MessageSid. If a turn is marked failed_retryable in-process (for example, final routing failed) and retried within the dedupe TTL, the replay is skipped as a duplicate; then dispatchInboundMessageInternal sees no outbox rows and finalizes the turn as delivered, so the failed reply is silently dropped and no further recovery happens.
Useful? React with 👍 / 👎.
| } else if (attemptedFinal > 0 && result.queuedFinal) { | ||
| // Fail-open for routed/direct sends where provider success is known but outbox | ||
| // persistence may be unavailable. | ||
| finalizeTurn(turnId, "delivered"); |
There was a problem hiding this comment.
Gate fail-open delivered status on actual final send success
This fail-open branch marks a turn delivered whenever attemptedFinal > 0 && queuedFinal, but queuedFinal only means sendFinalReply accepted the payload into the dispatcher queue, not that delivery succeeded. If outbox persistence fails (enqueueDelivery returns null) and the provider send later fails, no outbox row exists and this branch still terminalizes as delivered, which suppresses retry/recovery for a user-visible missing final reply.
Useful? React with 👍 / 👎.
|
Superseded by fresh PR stack — links to follow. |
|
New PR stack:
|
…d premature finalization (Codex P1+P2 openclaw#29149)
…d premature finalization (Codex P1+P2 openclaw#29149)
…d premature finalization (Codex P1+P2 openclaw#29149)
…d premature finalization (Codex P1+P2 openclaw#29149)
Summary
message_turnstable; turn worker detects orphaned turns and recovers them via replay; abort commands mark turns as aborted; outbox entries are linked to turns for coordinated finalizationChange Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
dedupe_key) are silently dropped/stop,/abort) mark pending turns as aborted, preventing re-deliverySecurity Impact (required)
Repro + Verification
Environment
Steps
message_turnsExpected
MAX_TURN_RECOVERY_AGE_MS) are marked failed_terminalActual
Evidence
Human Verification (required)
pnpm build,pnpm test(948 pass),pnpm checkCompatibility / Migration
message_turnstable is created automatically on first accessFailure Recovery (if this breaks)
Risks and Mitigations
dedupe_keycollision causes legitimate messages to be droppedE2E Test Results
All 4 tests pass ✅
message_turnsrow withstatus='delivered'+ outbox linked viaturn_idkill -9gateway → restart → verify orphan recoveredTest script: https://gist.github.com/nohat/c13dd3df24c24adc83a5b2c94504c3a5