Skip to content

feat(lifecycle): inbound turn tracking, orphan recovery, and abort coordination#29956

Closed
nohat wants to merge 16 commits intoopenclaw:mainfrom
nohat:lifecycle/turn-tracking-temp
Closed

feat(lifecycle): inbound turn tracking, orphan recovery, and abort coordination#29956
nohat wants to merge 16 commits intoopenclaw:mainfrom
nohat:lifecycle/turn-tracking-temp

Conversation

@nohat
Copy link
Contributor

@nohat nohat commented Feb 28, 2026

Summary

Stack: merge after #29953 (the diff will be correct once #29953 is merged; until then this shows the combined diff)

Adds durable inbound turn tracking so the gateway can detect and recover orphaned turns (e.g., after a crash mid-stream).

Key differences from old #29149:

  • Active-turn registry prevents recovery races and premature finalization
  • Outbox entries cancelled for aborted turns
  • Dead minAgeMs parameter removed from listRecoverableTurns
  • Inbound dedupe bypassed for resumed turns
  • Fail-open gated on delivery stats (not unconditional)

Carries forward:

  • Durable turn records in SQLite (inbound_turns table)
  • Orphan recovery: turns older than threshold with no heartbeat are resumed
  • Inbound dedup: duplicate messages rejected via dedupe_key
  • Abort marking: cancelled turns have outbox entries cleaned up

Closes #26764, #29124, #29125, #29127
Related: #28941

Test plan

  • pnpm build passes
  • pnpm test passes (turn tracking, recovery, dedup tests)
  • Manual: kill gateway mid-turn, restart, verify orphan recovery resumes the turn
  • Manual: send duplicate message, verify dedup rejects it
  • Manual: abort a turn, verify outbox entries are cancelled

🤖 Generated with Claude Code

nohat and others added 16 commits February 28, 2026 09:34
Replace unbounded file-based delivery queue with queryable SQLite
message_outbox table. Adds TTL/expiry for stale entries, delivery
outcome retention, and one-time legacy file queue import on startup.

Closes openclaw#23777, openclaw#16555, openclaw#29128
…pat layer

Write-ahead delivery pattern: enqueue outbox entry before sending, ack on
success, retry on failure. Continuous outbox worker replaces one-shot
recovery. Plugin channels get durable delivery guarantees via v1/v2
adapter compat layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l and separate ackDelivery errors in recovery
Every inbound message creates a durable turn record in message_turns.
Turn worker detects orphaned turns (accepted but never completed after
crash) and recovers them. Abort commands mark turns as aborted,
preventing re-delivery. Outbox entries are linked to turns for
coordinated finalization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openclaw-barnacle openclaw-barnacle bot added channel: bluebubbles Channel integration: bluebubbles channel: discord Channel integration: discord channel: googlechat Channel integration: googlechat channel: imessage Channel integration: imessage channel: line Channel integration: line channel: matrix Channel integration: matrix channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: nextcloud-talk Channel integration: nextcloud-talk channel: nostr Channel integration: nostr channel: signal Channel integration: signal channel: slack Channel integration: slack channel: telegram Channel integration: telegram labels Feb 28, 2026
@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

2 similar comments
@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

19 similar comments
@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

@openclaw-barnacle
Copy link

Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5e4c0de5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +606 to +609
SET status='failed_terminal',
error_class='terminal',
terminal_reason='non_final_recovery_skip',
completed_at=?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid terminalizing skipped non-final recovery entries

recoverPendingDeliveries marks queued tool/block rows as failed_terminal, but the turn worker later treats any failed outbox row as a hard turn failure (outbox.failed > 0 in runTurnPass) instead of replaying the turn. In a crash where a non-final row is persisted but the final reply was never sent, this causes the whole turn to be finalized as failed and the user never gets the final response.

Useful? React with 👍 / 👎.

FROM message_outbox
WHERE status IN ('queued', 'failed_retryable')
AND next_attempt_at <= ?
AND (queued_at < ? OR last_attempt_at IS NOT NULL OR attempt_count > 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recover fresh queued rows when direct-path ack bookkeeping fails

The startup-cutoff filter excludes rows enqueued after startup unless they already have retry metadata, but ackDelivery logs and swallows DB update errors; if that update fails, a successfully sent row can remain queued with attempt_count=0 and last_attempt_at=NULL forever. In that state this predicate keeps skipping the row on every worker pass, so the outbox/turn state never converges until stale-turn cleanup.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

This PR adds comprehensive durable turn tracking and orphan recovery infrastructure to prevent message loss after gateway crashes. The implementation migrates from file-based to SQLite-backed queuing and introduces coordinated turn lifecycle management.

Key changes:

  • Added message_turns and message_outbox SQLite tables for durable state tracking
  • Implemented active-turn registry (in-memory Set) to prevent recovery races between live dispatch and recovery worker
  • Added abort coordination that marks turns as aborted AND cancels pending outbox entries
  • Created recovery workers for both turn replay and outbox message delivery
  • Integrated turn tracking into dispatch flow with proper registration/unregistration
  • Added fail-open logic gated on actual delivery stats (not just queue acceptance)
  • Implemented startup cutoff to prevent double-delivery of messages enqueued during current instance lifetime
  • Migrated all channel extensions to support sendFinal adapter method

Design strengths:

  • Turn finalization uses outbox status as source of truth, avoiding premature "delivered" status
  • Recovery worker skips non-final payloads (tool/block sends) since turn replay regenerates them
  • Persistent dedupe intentionally deferred to existing inbound dedupe path (documented)
  • Proper transaction handling with runLifecycleTransaction wrapper
  • Comprehensive error handling with fallback to in-memory dedupe cache

Test coverage: Tests updated for new dispatch behavior, outbox integration, and delivery stats tracking

Confidence Score: 4/5

  • Safe to merge with minor considerations around race conditions in edge cases
  • The implementation is well-architected with proper error handling, transaction management, and race condition prevention. The active-turn registry correctly prevents most recovery races, and the startup cutoff prevents double-delivery. Turn finalization logic properly uses outbox status as source of truth. Docking one point for: (1) small race window between isTurnActive() check and registerActiveTurn() call during recovery, though worst case is duplicate sends which are acceptable, and (2) persistent dedupe currently disabled pending normalization of per-channel message identity semantics
  • Pay close attention to src/gateway/server-message-lifecycle.ts (recovery worker coordination), src/auto-reply/dispatch.ts (turn lifecycle management), and src/infra/message-lifecycle/turns.ts (core state tracking)

Last reviewed commit: e5e4c0d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: bluebubbles Channel integration: bluebubbles channel: discord Channel integration: discord channel: feishu Channel integration: feishu channel: googlechat Channel integration: googlechat channel: imessage Channel integration: imessage channel: irc channel: line Channel integration: line channel: matrix Channel integration: matrix channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: nextcloud-talk Channel integration: nextcloud-talk channel: nostr Channel integration: nostr channel: signal Channel integration: signal channel: slack Channel integration: slack channel: telegram Channel integration: telegram channel: tlon Channel integration: tlon channel: twitch Channel integration: twitch channel: whatsapp-web Channel integration: whatsapp-web channel: zalo Channel integration: zalo channel: zalouser Channel integration: zalouser commands Command implementations gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Telegram inbound message can be re-queued on model fallback/rate-limit, causing duplicate user turns and missing outbound delivery

1 participant