Skip to content

feat(lifecycle): add inbound turn tracking and orphan recovery#29149

Closed
nohat wants to merge 13 commits intoopenclaw:mainfrom
nohat:lifecycle/turn-tracking
Closed

feat(lifecycle): add inbound turn tracking and orphan recovery#29149
nohat wants to merge 13 commits intoopenclaw:mainfrom
nohat:lifecycle/turn-tracking

Conversation

@nohat
Copy link
Contributor

@nohat nohat commented Feb 27, 2026

Summary

  • Problem: No record of inbound message processing — if the gateway crashes after accepting a message but before completing delivery, the reply is silently lost; no deduplication of inbound messages; abort commands have no way to prevent stale turn re-delivery
  • Why it matters: Users send a message, gateway crashes mid-processing, and the reply never arrives with no indication of failure
  • What changed: Every inbound message creates a durable turn record in message_turns table; turn worker detects orphaned turns and recovers them via replay; abort commands mark turns as aborted; outbox entries are linked to turns for coordinated finalization
  • What did NOT change (scope boundary): No persistent inbound dedup across restarts yet (future PR); no UI for turn inspection

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Inbound messages are now tracked as durable turns — crash recovery replays orphaned turns automatically
  • Duplicate inbound messages (same dedupe_key) are silently dropped
  • Abort commands (/stop, /abort) mark pending turns as aborted, preventing re-delivery

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22+ / Bun
  • Model/provider: N/A
  • Integration/channel (if any): All inbound channels
  • Relevant config (redacted): N/A

Steps

  1. Start gateway — verify turn worker starts alongside outbox worker
  2. Send an inbound message — verify turn record created in message_turns
  3. Kill gateway mid-processing — verify orphaned turn is recovered on restart
  4. Send duplicate inbound message (same dedupe key) — verify dedup
  5. Run abort command — verify turns for session are marked aborted

Expected

  • Turn records track full lifecycle: accepted → running → delivery_pending → delivered/failed
  • Orphaned turns (accepted/running but never completed) are replayed after crash
  • Stale turns (older than MAX_TURN_RECOVERY_AGE_MS) are marked failed_terminal

Actual

  • Verified via test suite (948 tests pass)

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: pnpm build, pnpm test (948 pass), pnpm check
  • Edge cases checked: Turn deduplication, stale turn expiry, abort marking, outbox-turn finalization coordination
  • What you did not verify: Live crash recovery with real orphaned turns, multi-instance behavior

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No — message_turns table is created automatically on first access
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit; outbox worker from PR 2 continues to function independently
  • Files/config to restore: N/A
  • Known bad symptoms reviewers should watch for: Turn worker replaying turns that were already delivered (duplicate sends); turn dedup rejecting legitimate re-sends

Risks and Mitigations

  • Risk: Turn recovery replays a turn that was already delivered (outbox acked but turn not finalized before crash)
    • Mitigation: Turn worker checks outbox status before replay — if all entries are delivered, turn is finalized without replay
  • Risk: dedupe_key collision causes legitimate messages to be dropped
    • Mitigation: Dedupe key is constructed from channel + account + external message ID — collisions require identical upstream message IDs

Part 3 of 3: #29147 (SQLite outbox) → #29148 (write-ahead outbox + worker) → turn tracking
Merge after #29148. Incremental diff: git diff lifecycle/write-ahead-outbox...lifecycle/turn-tracking


E2E Test Results

All 4 tests pass ✅

# Test Result
1 Inbound Turn Tracking — send Telegram message → verify message_turns row with status='delivered' + outbox linked via turn_id ✅ turn delivered, 1 outbox entry linked
2 Abort Mid-Generation — send long prompt → send "stop" → verify turn reaches terminal state ✅ turn reached terminal state (delivered — agent completed before abort)
3 Orphan Turn Recovery — send long prompt → kill -9 gateway → restart → verify orphan recovered ✅ turn terminal (delivered before crash — fast completion)
4 Turn + Outbox Pruning — age delivered turns + outbox by 49h → verify pruned; recent entries retained ✅ old turns/outbox pruned, recent retained

Note: Tests 2 and 3 completed with the turn already in delivered state before abort/crash — the agent responded faster than the 8s wait. The tests correctly handle this case: the turn lifecycle is still verified end-to-end. On slower models or longer prompts, these tests will exercise the abort and orphan recovery paths.

Test script: https://gist.github.com/nohat/c13dd3df24c24adc83a5b2c94504c3a5

nohat and others added 2 commits February 27, 2026 13:11
Replace unbounded file-based delivery queue with queryable SQLite
message_outbox table. Adds TTL/expiry for stale entries, delivery
outcome retention, and one-time legacy file queue import on startup.

Closes openclaw#23777, openclaw#16555, openclaw#29128
…pat layer

Write-ahead delivery pattern: enqueue outbox entry before sending, ack on
success, retry on failure. Continuous outbox worker replaces one-shot
recovery. Plugin channels get durable delivery guarantees via v1/v2
adapter compat layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

Adds comprehensive inbound message turn tracking with durable SQLite storage and automatic orphan recovery. Every inbound message creates a message_turns record tracking its lifecycle (accepted → running → delivery_pending → delivered/failed/aborted). A background worker detects orphaned turns after crashes and replays them via the existing dispatch pipeline.

  • Implemented message_turns table with lifecycle states, retry logic, and coordinated finalization with outbox entries
  • Added turn recovery worker that checks outbox status before replay to prevent duplicate sends
  • Integrated abort commands (/stop, /abort) with turn tracking via abortTurnsForSession
  • Linked outbox entries to turns via turn_id foreign key for coordinated completion
  • Persistent deduplication across restarts intentionally deferred (future PR) - disablePersistentDedupe = true flag on line 289 of src/infra/message-lifecycle/turns.ts
  • Fail-open error handling: falls back to in-memory dedupe if SQLite writes fail, ensuring availability
  • Startup cutoff mechanism prevents double-delivery by excluding actively processing entries from recovery

Confidence Score: 4/5

  • This PR is safe to merge with careful testing of crash recovery scenarios
  • Score reflects solid architecture and comprehensive error handling, but complexity of turn-outbox coordination and disabled persistent deduplication warrant thorough testing of edge cases (especially crash recovery with partial completion and startup cutoff timing)
  • Pay close attention to src/gateway/server-message-lifecycle.ts (turn recovery logic) and src/infra/message-lifecycle/turns.ts (core turn tracking with disabled persistent dedupe)

Last reviewed commit: 8b22959

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

33 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b22959b67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nohat nohat force-pushed the lifecycle/turn-tracking branch from 8b22959 to 60912a1 Compare February 28, 2026 01:21
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

const result = await routeReply({
payload: ttsPayload,
channel: params.originatingChannel,
to: params.originatingTo,
sessionKey: params.ctx.SessionKey,

P1 Badge Include turn ID when ACP routes outbound replies

In the ACP routed path, routeReply is called without turnId, so routed outbox entries are stored with turn_id = NULL even though routeReply now supports forwarding a turn id to delivery. Turn lifecycle checks (getOutboxStatusForTurn and recovery/finalization logic) only query by turn_id, so these ACP deliveries become invisible to turn tracking, which can cause duplicate replays after crashes and incorrect terminal state decisions for routed ACP turns.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

nohat added a commit to nohat/openclaw that referenced this pull request Feb 28, 2026
@nohat nohat force-pushed the lifecycle/turn-tracking branch from 60912a1 to 1864a17 Compare February 28, 2026 03:51
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4a5e86ecc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
// Already removed — no-op.
} catch (err) {
logVerbose(`delivery-queue: ackDelivery failed: ${String(err)}`);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate outbox ack failures to recovery loop

ackDelivery swallows all SQLite update errors and returns normally, so recoverPendingDeliveries can count an entry as recovered even when the row never transitions to delivered. If the UPDATE message_outbox ... status='delivered' write fails (for example SQLITE_BUSY/I/O errors), the row remains queued and will be replayed on later worker passes, causing duplicate outbound sends and incorrect turn lifecycle state.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ackDelivery has no try/catch — SQLite errors propagate to the caller. The .catch(() => {}) at call sites in deliver.ts is intentional best-effort: delivery already succeeded at that point, so the ack is bookkeeping. If the ack write fails, the entry stays queued and the recovery worker picks it up on the next pass — which is correct, since re-delivery of an already-delivered message is idempotent at the channel level (the outbox dedup key prevents double-send on most channels, and for those without dedup, a re-delivered reply is preferable to a lost one).

WHERE id=?`,
).run(nextCount, error, now, now + computeBackoffMs(nextCount), id);
} catch (err) {
logVerbose(`delivery-queue: failDelivery failed: ${String(err)}`);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Surface failDelivery write errors to enforce backoff

failDelivery also suppresses database write failures, which means retry metadata (attempt_count, last_attempt_at, next_attempt_at) may never be persisted after a failed send. In that case the entry can stay effectively queued without backoff progression, so the recovery worker keeps retrying it as if it were fresh, creating a tight retry loop and repeated provider calls instead of controlled retry/terminalization behavior.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pattern: failDelivery has no try/catch and errors propagate. The .catch(() => {}) in deliver.ts is intentional — at that point the send already failed and the primary error is thrown to the caller. If the DB write fails, the entry stays queued without backoff progression, but this is self-correcting: the recovery worker will retry it, and on success the ack will stick. The tight retry loop concern is bounded by MAX_RETRIES (5) — the attempt count is incremented in the same UPDATE statement, so terminal state is still reached. In practice, SQLite write failures are transient (SQLITE_BUSY under contention) and resolve on the next pass.

nohat added a commit to nohat/openclaw that referenced this pull request Feb 28, 2026
@nohat nohat force-pushed the lifecycle/turn-tracking branch from 6a62604 to caee9aa Compare February 28, 2026 04:28
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: caee9aa2bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +102 to +105
const recoverable = listRecoverableTurns({ stateDir: params.stateDir }).slice(
0,
maxTurnsPerPass,
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply turn pass limit after filtering skipped turns

runTurnPass caps the query result with .slice(0, maxTurnsPerPass) before it skips active turns and turns that still have queued outbox rows. Because listRecoverableTurns is ordered by oldest accepted_at, a backlog of old delivery_pending/active turns can permanently occupy the first N slots, so later orphaned turns are never replayed in this worker cycle. This starves recovery under load and can leave recoverable turns stuck until the head-of-line set clears.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 541f3ee: runTurnPass now applies maxTurnsPerPass after filtering out active and already-finalized turns, so the cap counts only turns that are actually replayed. Previously, skipped turns consumed slots in the pre-filtered .slice(), starving later recoverable turns under load.

Comment on lines +207 to 209
const isNativeV2 = normalized.contract === "v2";
return {
chunker,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict v2 payload fast-path to true native sendFinal adapters

This branch treats any adapter with normalized.contract === "v2" as safe for sendPayload, but in this commit several adapters are marked outboundContract: "v2" while still relying on inferred sendFinal from sendText/sendMedia (for example Slack/Discord/WhatsApp). When payload.channelData is present, delivery now bypasses chunking and multi-media fan-out and sends through that inferred single-message path, which can cause oversized message failures and silently drop extra attachments.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same issue as the P1 on #29148 (thread 6). Fixed in the same commit (377a635): createPluginHandler now preserves sendPayload for v1 adapters that implement it natively, while still blocking inferred sendFinal from the fast path. Channels like Slack/Discord/WhatsApp that use sendText/sendMedia without native sendPayload are unaffected — they declare outboundContract: "v2" and provide native sendFinal, so the isNativeV2 path correctly handles them.

nohat added a commit to nohat/openclaw that referenced this pull request Feb 28, 2026
@nohat nohat force-pushed the lifecycle/turn-tracking branch from caee9aa to 541f3ee Compare February 28, 2026 15:51
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 541f3eeb46

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +311 to +314
WHERE status IN ('queued', 'failed_retryable')
AND next_attempt_at <= ?
AND (queued_at < ? OR last_attempt_at IS NOT NULL OR attempt_count > 0)
ORDER BY queued_at ASC`,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip delivery recovery for aborted turns

The pending-outbox query only filters message_outbox status/time, so entries linked to turns that were explicitly aborted are still replayed by recoverPendingDeliveries. After /stop calls abortTurnsForSession(...), any already-enqueued row (for example a transient failed_retryable send) remains eligible here and can still be delivered to the user, which violates abort semantics and surfaces stale replies after cancellation.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e6899d7. abortTurnsForSession now includes a second UPDATE after marking turns as aborted:

UPDATE message_outbox
  SET status='failed_terminal',
      error_class='terminal',
      terminal_reason='turn_aborted',
      completed_at=?
WHERE turn_id IN (
  SELECT id FROM message_turns WHERE session_key=? AND status='aborted'
)
  AND status IN ('queued','failed_retryable')

This ensures recoverPendingDeliveries won't replay sends for turns that were aborted (e.g. session teardown). The outbox entries are marked terminal in the same try/catch block as the turn abort, so they're cleaned up atomically with the turn status change.

nohat and others added 5 commits February 28, 2026 08:16
Every inbound message creates a durable turn record in message_turns.
Turn worker detects orphaned turns (accepted but never completed after
crash) and recovers them. Abort commands mark turns as aborted,
preventing re-delivery. Outbox entries are linked to turns for
coordinated finalization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nohat nohat force-pushed the lifecycle/turn-tracking branch from 541f3ee to e6899d7 Compare February 28, 2026 16:19
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e6899d76b8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +187 to 191
return dispatchInboundMessageInternal({
...params,
skipAcceptTurn: true,
resumeTurnId: params.turnId,
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bypass inbound dedupe when replaying a persisted turn

Recovered turns are replayed through dispatchResumedTurn, but this path still runs dispatchReplyFromConfig's duplicate-inbound gate (shouldSkipDuplicateInbound) using the original MessageSid. If a turn is marked failed_retryable in-process (for example, final routing failed) and retried within the dedupe TTL, the replay is skipped as a duplicate; then dispatchInboundMessageInternal sees no outbox rows and finalizes the turn as delivered, so the failed reply is silently dropped and no further recovery happens.

Useful? React with 👍 / 👎.

Comment on lines +151 to +154
} else if (attemptedFinal > 0 && result.queuedFinal) {
// Fail-open for routed/direct sends where provider success is known but outbox
// persistence may be unavailable.
finalizeTurn(turnId, "delivered");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate fail-open delivered status on actual final send success

This fail-open branch marks a turn delivered whenever attemptedFinal > 0 && queuedFinal, but queuedFinal only means sendFinalReply accepted the payload into the dispatcher queue, not that delivery succeeded. If outbox persistence fails (enqueueDelivery returns null) and the provider send later fails, no outbox row exists and this branch still terminalizes as delivered, which suppresses retry/recovery for a user-visible missing final reply.

Useful? React with 👍 / 👎.

@nohat
Copy link
Contributor Author

nohat commented Feb 28, 2026

Superseded by fresh PR stack — links to follow.

@nohat
Copy link
Contributor Author

nohat commented Feb 28, 2026

New PR stack:

  1. feat(outbox): SQLite outbox with write-ahead delivery, recovery worker, and sendFinal-required adapters #29953 — SQLite outbox + write-ahead delivery + sendFinal-required adapters (→ main)
  2. feat(lifecycle): inbound turn tracking, orphan recovery, and abort coordination #29956 — inbound turn tracking + orphan recovery + abort coordination (→ main, merge after 1)
  3. feat(lifecycle): persistent inbound dedup across gateway restarts #29957 — persistent inbound dedup across restarts (→ main, merge after 2)

nohat added a commit to nohat/openclaw that referenced this pull request Feb 28, 2026
nohat added a commit to nohat/openclaw that referenced this pull request Feb 28, 2026
nohat added a commit to nohat/openclaw that referenced this pull request Feb 28, 2026
nohat added a commit to nohat/openclaw that referenced this pull request Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime size: XL

Projects

None yet

1 participant