Skip to content

fix: recover stale reply turns with valid send action#81202

Closed
moeedahmed wants to merge 1 commit into
openclaw:mainfrom
moeedahmed:fix/stale-reply-turn-recovery-schema
Closed

fix: recover stale reply turns with valid send action#81202
moeedahmed wants to merge 1 commit into
openclaw:mainfrom
moeedahmed:fix/stale-reply-turn-recovery-schema

Conversation

@moeedahmed

Copy link
Copy Markdown
Contributor

Summary

Fixes stale reply-turn recovery so Telegram interruption fallback messages use the current message.action schema.

Why

A stalled reply turn could be released internally but fail to notify Telegram because recovery called message.action with obsolete top-level target, threadId, and message fields. The gateway now requires nested params plus idempotencyKey, so the fallback was rejected and users saw silence.

Changes

  • Sends recovery fallback through message.action using nested params.to, params.threadId, and params.message.
  • Adds a deterministic idempotency key for stale reply-turn recovery.
  • Adds regression coverage for stale unfinished topic-turn closure and the outbound schema.

Real behavior proof

Behavior addressed: stale Telegram reply-turn recovery silently failed after a long run/tool timeout.

Real environment tested: Moeed Ahmed's local OpenClaw gateway on macOS, Telegram forum topic, installed runtime and clean upstream worktree.

Exact steps or command run after the patch: pnpm oxfmt --check src/logging/diagnostic-stuck-session-recovery.runtime.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts; pnpm vitest run src/logging/diagnostic-stuck-session-recovery.runtime.test.ts; node --check /opt/homebrew/lib/node_modules/openclaw/dist/diagnostic-stuck-session-recovery.runtime-CkodWvBA.js.

Evidence after fix: formatting passed; targeted Vitest passed 1 file / 12 tests; installed runtime syntax check passed; live gateway remained reachable and Telegram channel status was OK.

Observed result after fix: the recovery fallback now builds a valid message.action request with nested send params and an idempotency key, so stale reply-turn closure can notify Telegram instead of failing schema validation.

What was not tested: full upstream CI and live forced production stale-turn recovery after release packaging.

Tests

  • pnpm oxfmt --check src/logging/diagnostic-stuck-session-recovery.runtime.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts
  • pnpm vitest run src/logging/diagnostic-stuck-session-recovery.runtime.test.ts
  • node --check /opt/homebrew/lib/node_modules/openclaw/dist/diagnostic-stuck-session-recovery.runtime-CkodWvBA.js

@openclaw-barnacle openclaw-barnacle Bot added size: M triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 12, 2026
@clawsweeper

clawsweeper Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The PR adds stale reply-turn recovery to diagnostic stuck-session cleanup by scanning session stores, sending a fallback through message.action, marking the turn failed, and adding regression coverage.

Reproducibility: no. not for the reported live Telegram stale-turn silence: the PR describes the scenario, but no forced stale-turn Telegram recovery proof is provided. Source inspection does show a high-confidence patch defect because the proposed replyTurn* trigger fields are not present or written on current main.

Real behavior proof
Needs real behavior proof before merge: The PR lists formatting, Vitest, syntax checks, and a live status check, but does not show after-fix evidence that a forced stale Telegram reply-turn recovery delivered the fallback message. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Manual review is needed because this external draft has a blocking source-level defect and missing real Telegram behavior proof; automation should not repair it while the contributor proof gate is unmet.

Security
Cleared: No concrete security or supply-chain regression was found in the diff; the runtime change uses existing local session metadata and the existing gateway RPC path.

Review findings

  • [P1] Persist the reply-turn state before reading it — src/logging/diagnostic-stuck-session-recovery.runtime.ts:110-113
Review details

Best possible solution:

Base stale-turn recovery on a real persisted current-main turn-state contract, or add that typed contract and its writer before sending the fallback through the current message.action schema.

Do we have a high-confidence way to reproduce the issue?

No, not for the reported live Telegram stale-turn silence: the PR describes the scenario, but no forced stale-turn Telegram recovery proof is provided. Source inspection does show a high-confidence patch defect because the proposed replyTurn* trigger fields are not present or written on current main.

Is this the best way to solve the issue?

No. The nested message.action request is the right schema direction, but the PR must first connect recovery to persisted runtime state instead of hand-seeded test-only fields.

Full review comments:

  • [P1] Persist the reply-turn state before reading it — src/logging/diagnostic-stuck-session-recovery.runtime.ts:110-113
    The new recovery branch gates on entry.replyTurnState and replyTurnUpdatedAt, but current main has no SessionEntry fields or production writer for any replyTurn* key. This will fail strict typing, and even if forced through, real session stores will never satisfy the branch; the regression test only passes because it hand-writes those fields. Add the session-store contract and writer, or recover from existing persisted state.
    Confidence: 0.93

Overall correctness: patch is incorrect
Overall confidence: 0.9

What I checked:

  • Current recovery behavior: Current main's stuck-session recovery checks active work, lane activity, and resets the command lane; it has no outbound fallback delivery path in this file. (src/logging/diagnostic-stuck-session-recovery.runtime.ts:165, 8998e536fa0e)
  • Gateway send contract: message.action currently requires top-level channel, action, nested params, and top-level idempotencyKey, matching the PR's intended send shape. (src/gateway/protocol/schema/agent.ts:68, 8998e536fa0e)
  • Missing session-store contract: The current SessionEntry routing fields include deliveryContext, lastTo, lastAccountId, and lastThreadId; there are no typed replyTurnState, replyTurnStartedAt, replyTurnUpdatedAt, or replyTurnLastError fields. (src/config/sessions/types.ts:356, 8998e536fa0e)
  • No production writer found: A source search over current main found no production or docs references for the proposed replyTurn* fields, so the new recovery condition is not connected to persisted runtime state. (8998e536fa0e)
  • PR diff issue: The PR head gates recovery on entry.replyTurnState and timestamps that are not part of current-main SessionEntry, while its regression test hand-seeds those fields directly. (src/logging/diagnostic-stuck-session-recovery.runtime.ts:110, 0e4b0def6e90)
  • Telegram proof standard: The Telegram maintainer note says Telegram behavior PRs touching topics or reply context need real Telegram proof; this PR says forced production stale-turn recovery was not tested. (.agents/maintainer-notes/telegram.md:37, 8998e536fa0e)

Likely related people:

  • Peter Steinberger: Visible current-main git log and git blame in this checkout show recent work across the diagnostic recovery file and gateway schema surfaces touched by this PR. (role: recent visible area contributor; confidence: low; commits: 9531befca524; files: src/logging/diagnostic-stuck-session-recovery.runtime.ts, src/gateway/protocol/schema/agent.ts)

Remaining risk / open question:

  • The reported live Telegram stale-turn silence was not independently reproduced in this read-only review.
  • The PR body reports only mocked/targeted checks for the changed behavior and explicitly does not prove a forced production stale-turn recovery.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 8998e536fa0e.

@clawsweeper clawsweeper Bot added the mantis: telegram-visible-proof Mantis should capture Telegram visible proof. label May 12, 2026
@moeedahmed

Copy link
Copy Markdown
Contributor Author

Thanks — acknowledged. This PR currently has schema/runtime validation proof, but not the forced live Telegram stale-turn recovery proof requested here.

Current proof available:

pnpm oxfmt --check src/logging/diagnostic-stuck-session-recovery.runtime.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts
pnpm vitest run src/logging/diagnostic-stuck-session-recovery.runtime.test.ts
# 1 file passed, 12 tests passed
node --check <installed diagnostic-stuck-session-recovery runtime chunk>

The proof shows the fallback now builds a valid message.action request with nested send params and an idempotency key. I am not claiming that a forced live Telegram stale-turn recovery has been recorded yet.

I will add redacted live recovery evidence before treating this PR as merge-ready; for now, please keep the proof blocker open rather than assuming the contributor missed the request.

@moeedahmed

Copy link
Copy Markdown
Contributor Author

Closing this draft to reduce duplicate/open PR load.

Reason: the branch is conflicting and ClawSweeper identified a blocking contract issue: it depends on replyTurn* fields that current main does not type or persist, so the recovery branch would not run against real session stores. The proof is mock-only/hand-seeded rather than real Telegram stale-turn recovery.

The useful learning has been carried forward as context on #86089. If this behaviour is still needed, it should be rebuilt against the current persisted session contract with live Telegram recovery proof.

@moeedahmed moeedahmed closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mantis: telegram-visible-proof Mantis should capture Telegram visible proof. size: M triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant