Skip to content

fix(telegram): recover stale ingress claims after restart#84690

Open
zerone0x wants to merge 1 commit into
openclaw:mainfrom
zerone0x:fix/84674-telegram-spool-stale-claim
Open

fix(telegram): recover stale ingress claims after restart#84690
zerone0x wants to merge 1 commit into
openclaw:mainfrom
zerone0x:fix/84674-telegram-spool-stale-claim

Conversation

@zerone0x

Copy link
Copy Markdown
Contributor

Summary

  • Problem: Telegram isolated ingress could keep a fresh-looking .json.processing claim from a previous gateway identity when the PID was reused after recreate, blocking later same-lane updates.
  • Solution: Treat claims whose stored processPid equals the current process but whose unique processId differs as recoverable stale claims, not another live owner.
  • What changed: Updated Telegram spool ownership detection and regression tests for startup/drain recovery of restart-stale claims.
  • What did NOT change (scope boundary): No changes to Telegram delivery, handler timeout tombstoning, or cross-process live-claim protection for different live PIDs.

Motivation

  • Fixes a released Docker failure mode where a stale Telegram .processing file can survive gateway recreate and block later updates until manually moved aside.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: stale Telegram isolated-ingress .json.processing claim from an older gateway identity with the current PID no longer blocks later queued updates.
  • Real environment tested: local Linux checkout, Node v22.19.0, Telegram plugin spool/session tests.
  • Exact steps or command run after this patch: node scripts/run-vitest.mjs extensions/telegram/src/telegram-ingress-spool.test.ts extensions/telegram/src/polling-session.test.ts
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):
    ✓  extension-telegram  ../../extensions/telegram/src/polling-session.test.ts (42 tests) 9850ms
    ✓  extension-telegram  ../../extensions/telegram/src/telegram-ingress-spool.test.ts (7 tests) 14ms
    
    Test Files  2 passed (2)
         Tests  49 passed (49)
    
  • Observed result after fix: restart-stale current-PID/different-process-id claims are recovered back to pending updates, and focused Telegram spool/session tests pass.
  • What was not tested: live Docker/Telegram replay against a real bot.
  • Before evidence (optional but encouraged): existing regression test preserved the .processing file and left only the later update pending.

Root Cause (if applicable)

  • Root cause: live-claim detection considered any fresh claim with an existing PID to be owned by another live process, even when that PID was the current gateway process and the unique process id differed from the current gateway identity.
  • Missing detection / guardrail: no regression test covered gateway recreate / PID reuse where processPid === process.pid but processId belongs to an older gateway instance.
  • Contributing context (if known): isolated ingress drain uses this live-owner predicate before deciding whether to recover stale claims.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/telegram/src/telegram-ingress-spool.test.ts, extensions/telegram/src/polling-session.test.ts
  • Scenario the test should lock in: a .json.processing claim with the current PID but an older/different process identity is recoverable and returns the blocked update to the pending spool.
  • Why this is the smallest reliable guardrail: it exercises the exact plugin-local ownership predicate and drain recovery gate without needing live Telegram.
  • Existing test that already covers this (if any): N/A; existing coverage asserted the old blocking behavior.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram isolated polling ingress can recover stale .processing claims left behind by a recreated gateway when the PID is reused, preventing later updates from staying blocked indefinitely.

Diagram (if applicable)

Before:
old gateway claim (.processing, pid=current, processId=old) -> treated live-owned -> later .json updates stay blocked

After:
old gateway claim (.processing, pid=current, processId=old) -> recovered to .json -> drain can process queued updates

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux kvm16007 6.1.0-44-amd64 x86_64
  • Runtime/container: local source checkout, Node v22.19.0
  • Model/provider: N/A
  • Integration/channel (if any): Telegram isolated ingress spool
  • Relevant config (redacted): N/A

Steps

  1. Create a Telegram spool with update 42 claimed as .json.processing using processPid: process.pid and a different processId.
  2. Queue update 43 on the same lane.
  3. Run stale-claim recovery with the same live-owner gate used by isolated ingress drain.

Expected

  • Update 42 is recovered back to .json; updates 42 and 43 are pending in order.

Actual

  • After this patch, the focused regression test observes updates 42 and 43 pending and no lingering .json.processing file.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: focused Telegram spool/session regression tests pass; git diff --check passes.
  • Edge cases checked: existing stale-timeout recovery and handler/session tests still pass in the same focused run.
  • What you did not verify: live Docker/Telegram bot replay.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: accidentally replaying work still owned by a live second gateway.
    • Mitigation: the predicate only treats the current PID plus different process identity as restart-stale; different fresh live PIDs are still protected by the existing process-existence check.

@openclaw-barnacle openclaw-barnacle Bot added channel: telegram Channel integration: telegram size: XS triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 20, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR updates Telegram ingress spool ownership detection so current-PID claims from older process identities can be recovered, and adjusts focused spool/session regression tests.

Reproducibility: yes. source inspection gives a high-confidence reproduction path: current main's drain skips recovery when a fresh different-processId claim has the current PID, matching the linked stale processing-file report. I did not run the tests because this review is read-only.

PR rating
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Summary: The patch is small and source-aligned, but external real behavior proof is still missing for a delivery-sensitive Telegram fix.

Rank-up moves:

  • Add redacted live Docker/Telegram restart-spool proof, terminal output, logs, or a recording showing stale claim recovery and later message delivery.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body includes copied focused Vitest output only; it still needs redacted live Docker/Telegram replay evidence or equivalent runtime logs for the restart-stale claim path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Mantis proof suggestion
A live Telegram lane would materially prove that the recovery predicate fixes the delivery-blocking restart path outside focused tests. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

telegram live: verify isolated polling recovers a current-PID/different-process-id .json.processing claim after gateway restart and delivers the later queued message.

Risk before merge

  • The PR affects Telegram message delivery recovery, and merging without a live Docker/Telegram replay leaves the reported gateway-recreate failure mode proven only by focused tests.

Maintainer options:

  1. Require live restart-spool proof (recommended)
    Before merge, ask for a redacted Docker/Telegram or equivalent live ingress replay showing a current-PID/different-processId processing claim is recovered and the later queued update is delivered.
  2. Accept focused seam proof
    Maintainers could intentionally accept the unit/seam coverage if they decide this spool predicate is isolated enough for the released failure mode.

Next step before merge
The remaining blocker is real-behavior proof or maintainer acceptance for an external Telegram delivery fix; there is no narrow code defect for an automated repair worker.

Security
Cleared: The diff only changes Telegram spool ownership logic and tests; it does not add dependencies, permissions, secret handling, network calls, or code execution surfaces.

Review details

Best possible solution:

Land this narrow Telegram plugin predicate change after redacted real restart-spool proof, while preserving the existing different-PID live-claim protection.

Do we have a high-confidence way to reproduce the issue?

Yes, source inspection gives a high-confidence reproduction path: current main's drain skips recovery when a fresh different-processId claim has the current PID, matching the linked stale processing-file report. I did not run the tests because this review is read-only.

Is this the best way to solve the issue?

Yes, the code change is the narrowest maintainable fix because the ownership decision belongs in the Telegram spool predicate rather than core gateway policy. The remaining gap is runtime proof, not a different implementation direction.

Label changes:

  • add P1: The PR targets a released Telegram workflow where stale ingress state can block later messages and stop replies for real users.
  • add merge-risk: 🚨 message-delivery: The diff changes when Telegram processing claims are requeued, so a mistake could affect delivery or replay behavior beyond what focused tests prove.
  • add rating: 🦪 silver shellfish: Current PR rating is 🦪 silver shellfish because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is small and source-aligned, but external real behavior proof is still missing for a delivery-sensitive Telegram fix.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body includes copied focused Vitest output only; it still needs redacted live Docker/Telegram replay evidence or equivalent runtime logs for the restart-stale claim path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

  • P1: The PR targets a released Telegram workflow where stale ingress state can block later messages and stop replies for real users.
  • merge-risk: 🚨 message-delivery: The diff changes when Telegram processing claims are requeued, so a mistake could affect delivery or replay behavior beyond what focused tests prove.
  • rating: 🦪 silver shellfish: Current PR rating is 🦪 silver shellfish because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is small and source-aligned, but external real behavior proof is still missing for a delivery-sensitive Telegram fix.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body includes copied focused Vitest output only; it still needs redacted live Docker/Telegram replay evidence or equivalent runtime logs for the restart-stale claim path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

What I checked:

  • Current predicate blocks reused current PID: Current main treats any fresh claim with a different processId and an existing PID as owned by another live process, so a reused current PID from an older gateway identity is protected rather than recovered. (extensions/telegram/src/telegram-ingress-spool.ts:178, 1a7669bc63a0)
  • Drain gate uses that predicate before claiming pending updates: The isolated ingress drain calls stale-claim recovery with staleMs 0, then skips recovery when active lane keys or the live-owner predicate say the processing file is still owned. (extensions/telegram/src/polling-session.ts:503, 1a7669bc63a0)
  • Current test documents the old blocking behavior: The current polling-session test writes a processing claim with processPid equal to process.pid and a different processId, then expects no recovery and only the later update pending. (extensions/telegram/src/polling-session.test.ts:894, 1a7669bc63a0)
  • PR patch is focused on the implicated predicate: The provided PR diff adds the current-PID exclusion to the live-owner predicate and updates the two focused regression expectations around restart-stale claims. (extensions/telegram/src/telegram-ingress-spool.ts:181, b1e2bd62b564)
  • Proof is test-only: The PR body reports copied output from focused Telegram spool/session Vitest files and explicitly says live Docker/Telegram replay against a real bot was not tested. (b1e2bd62b564)
  • Telegram review context requires stronger runtime proof: Matching maintainer notes call for real Telegram proof for behavior PRs touching Telegram transport or delivery-sensitive paths, so the current synthetic-only evidence is not enough for an external PR merge gate. (.agents/maintainer-notes/telegram.md:37, 1a7669bc63a0)

Likely related people:

  • obviyus: Ayaan Zaidi introduced the Telegram ingress spool files and the live-owner predicate in commit 357e3ec. (role: introduced behavior; confidence: high; commits: 357e3ecc65e3; files: extensions/telegram/src/telegram-ingress-spool.ts, extensions/telegram/src/telegram-ingress-spool.test.ts, extensions/telegram/src/polling-session.ts)
  • steipete: Peter Steinberger has the heaviest recent history on the Telegram polling/session files and release-adjacent Telegram refactors around this path. (role: recent area contributor; confidence: medium; commits: 50a2481652b6, 0c0f1e34cba0, 53f90af990a6; files: extensions/telegram/src/polling-session.ts, extensions/telegram/src/polling-session.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1a7669bc63a0.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. labels May 20, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@martingarramon martingarramon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM-with-concerns. The core fix is correct — adding `claim.claim.processPid !== process.pid` short-circuits the live-owner check for restart-stale claims that recycled the current gateway PID, and the `polling-session.test.ts` change from `claimedAt: now - STALE_MS - 1` to `claimedAt: now` is important: the old test was passing because `isFreshClaimOwner` returned false (stale path), not because it was exercising the PID-reuse case. The new test covers the actual bug path.

Two actionable concerns:

1. Heading format — blocks CI gate.
`## Real behavior proof (required for external PRs)` should be `## Real behavior proof` (exact heading, no suffix). The CI gate matches the exact string; the `(required for external PRs)` suffix is causing the `Real behavior proof` check failure. Content is complete and correct — only the heading needs trimming.

2. Missing symmetric live-owner test.
The `polling-session.test.ts` change converts the existing blocked-case test into a recovery-case test. That's correct, but there's now no test asserting the symmetric path: "different PID + different processId + fresh + processExists → claim remains blocked." The old test was inadvertently covering this (via the stale path). A focused test for the still-blocked case would close the gap.

The other four CI failures (`check-additional-extension-bundled`, `check-lint`, `check-test-types`, `checks-node-core-runtime-infra-state`) are main-churn unrelated to this diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. size: XS status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram isolated ingress spool can remain blocked by stale .processing claim after gateway recreate

2 participants