fix(telegram): recover stale ingress claims after restart by zerone0x · Pull Request #84690 · openclaw/openclaw

zerone0x · 2026-05-20T20:07:29Z

Summary

Problem: Telegram isolated ingress could keep a fresh-looking .json.processing claim from a previous gateway identity when the PID was reused after recreate, blocking later same-lane updates.
Solution: Treat claims whose stored processPid equals the current process but whose unique processId differs as recoverable stale claims, not another live owner.
What changed: Updated Telegram spool ownership detection and regression tests for startup/drain recovery of restart-stale claims.
What did NOT change (scope boundary): No changes to Telegram delivery, handler timeout tombstoning, or cross-process live-claim protection for different live PIDs.

Motivation

Fixes a released Docker failure mode where a stale Telegram .processing file can survive gateway recreate and block later updates until manually moved aside.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes Telegram isolated ingress spool can remain blocked by stale .processing claim after gateway recreate #84674
Related #
This PR fixes a bug or regression

Real behavior proof (required for external PRs)

Behavior or issue addressed: stale Telegram isolated-ingress .json.processing claim from an older gateway identity with the current PID no longer blocks later queued updates.
Real environment tested: local Linux checkout, Node v22.19.0, Telegram plugin spool/session tests.
Exact steps or command run after this patch: node scripts/run-vitest.mjs extensions/telegram/src/telegram-ingress-spool.test.ts extensions/telegram/src/polling-session.test.ts

Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):

✓  extension-telegram  ../../extensions/telegram/src/polling-session.test.ts (42 tests) 9850ms
✓  extension-telegram  ../../extensions/telegram/src/telegram-ingress-spool.test.ts (7 tests) 14ms

Test Files  2 passed (2)
     Tests  49 passed (49)

Observed result after fix: restart-stale current-PID/different-process-id claims are recovered back to pending updates, and focused Telegram spool/session tests pass.
What was not tested: live Docker/Telegram replay against a real bot.
Before evidence (optional but encouraged): existing regression test preserved the .processing file and left only the later update pending.

Root Cause (if applicable)

Root cause: live-claim detection considered any fresh claim with an existing PID to be owned by another live process, even when that PID was the current gateway process and the unique process id differed from the current gateway identity.
Missing detection / guardrail: no regression test covered gateway recreate / PID reuse where processPid === process.pid but processId belongs to an older gateway instance.
Contributing context (if known): isolated ingress drain uses this live-owner predicate before deciding whether to recover stale claims.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/telegram/src/telegram-ingress-spool.test.ts, extensions/telegram/src/polling-session.test.ts
Scenario the test should lock in: a .json.processing claim with the current PID but an older/different process identity is recoverable and returns the blocked update to the pending spool.
Why this is the smallest reliable guardrail: it exercises the exact plugin-local ownership predicate and drain recovery gate without needing live Telegram.
Existing test that already covers this (if any): N/A; existing coverage asserted the old blocking behavior.
If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram isolated polling ingress can recover stale .processing claims left behind by a recreated gateway when the PID is reused, preventing later updates from staying blocked indefinitely.

Diagram (if applicable)

Before:
old gateway claim (.processing, pid=current, processId=old) -> treated live-owned -> later .json updates stay blocked

After:
old gateway claim (.processing, pid=current, processId=old) -> recovered to .json -> drain can process queued updates

Security Impact (required)

New permissions/capabilities? (No)
Secrets/tokens handling changed? (No)
New/changed network calls? (No)
Command/tool execution surface changed? (No)
Data access scope changed? (No)
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: Linux kvm16007 6.1.0-44-amd64 x86_64
Runtime/container: local source checkout, Node v22.19.0
Model/provider: N/A
Integration/channel (if any): Telegram isolated ingress spool
Relevant config (redacted): N/A

Steps

Create a Telegram spool with update 42 claimed as .json.processing using processPid: process.pid and a different processId.
Queue update 43 on the same lane.
Run stale-claim recovery with the same live-owner gate used by isolated ingress drain.

Expected

Update 42 is recovered back to .json; updates 42 and 43 are pending in order.

Actual

After this patch, the focused regression test observes updates 42 and 43 pending and no lingering .json.processing file.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: focused Telegram spool/session regression tests pass; git diff --check passes.
Edge cases checked: existing stale-timeout recovery and handler/session tests still pass in the same focused run.
What you did not verify: live Docker/Telegram bot replay.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (No)
Migration needed? (No)
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: accidentally replaying work still owned by a live second gateway.
- Mitigation: the predicate only treats the current PID plus different process identity as restart-stale; different fresh live PIDs are still protected by the existing process-existence check.

clawsweeper · 2026-05-20T20:09:09Z

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR updates Telegram ingress spool ownership detection so current-PID claims from older process identities can be recovered, and adjusts focused spool/session regression tests.

Reproducibility: yes. source inspection gives a high-confidence reproduction path: current main's drain skips recovery when a fresh different-processId claim has the current PID, matching the linked stale processing-file report. I did not run the tests because this review is read-only.

PR rating
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Summary: The patch is small and source-aligned, but external real behavior proof is still missing for a delivery-sensitive Telegram fix.

Rank-up moves:

Add redacted live Docker/Telegram restart-spool proof, terminal output, logs, or a recording showing stale claim recovery and later message delivery.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body includes copied focused Vitest output only; it still needs redacted live Docker/Telegram replay evidence or equivalent runtime logs for the restart-stale claim path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Mantis proof suggestion
A live Telegram lane would materially prove that the recovery predicate fixes the delivery-blocking restart path outside focused tests. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

telegram live: verify isolated polling recovers a current-PID/different-process-id .json.processing claim after gateway restart and delivers the later queued message.

Risk before merge

The PR affects Telegram message delivery recovery, and merging without a live Docker/Telegram replay leaves the reported gateway-recreate failure mode proven only by focused tests.

Maintainer options:

Require live restart-spool proof (recommended)
Before merge, ask for a redacted Docker/Telegram or equivalent live ingress replay showing a current-PID/different-processId processing claim is recovered and the later queued update is delivered.
Accept focused seam proof
Maintainers could intentionally accept the unit/seam coverage if they decide this spool predicate is isolated enough for the released failure mode.

Next step before merge
The remaining blocker is real-behavior proof or maintainer acceptance for an external Telegram delivery fix; there is no narrow code defect for an automated repair worker.

Security
Cleared: The diff only changes Telegram spool ownership logic and tests; it does not add dependencies, permissions, secret handling, network calls, or code execution surfaces.

Review details

Best possible solution:

Land this narrow Telegram plugin predicate change after redacted real restart-spool proof, while preserving the existing different-PID live-claim protection.

Do we have a high-confidence way to reproduce the issue?

Yes, source inspection gives a high-confidence reproduction path: current main's drain skips recovery when a fresh different-processId claim has the current PID, matching the linked stale processing-file report. I did not run the tests because this review is read-only.

Is this the best way to solve the issue?

Yes, the code change is the narrowest maintainable fix because the ownership decision belongs in the Telegram spool predicate rather than core gateway policy. The remaining gap is runtime proof, not a different implementation direction.

Label changes:

add P1: The PR targets a released Telegram workflow where stale ingress state can block later messages and stop replies for real users.
add merge-risk: 🚨 message-delivery: The diff changes when Telegram processing claims are requeued, so a mistake could affect delivery or replay behavior beyond what focused tests prove.
add rating: 🦪 silver shellfish: Current PR rating is 🦪 silver shellfish because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is small and source-aligned, but external real behavior proof is still missing for a delivery-sensitive Telegram fix.
add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body includes copied focused Vitest output only; it still needs redacted live Docker/Telegram replay evidence or equivalent runtime logs for the restart-stale claim path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

P1: The PR targets a released Telegram workflow where stale ingress state can block later messages and stop replies for real users.
merge-risk: 🚨 message-delivery: The diff changes when Telegram processing claims are requeued, so a mistake could affect delivery or replay behavior beyond what focused tests prove.
rating: 🦪 silver shellfish: Current PR rating is 🦪 silver shellfish because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is small and source-aligned, but external real behavior proof is still missing for a delivery-sensitive Telegram fix.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body includes copied focused Vitest output only; it still needs redacted live Docker/Telegram replay evidence or equivalent runtime logs for the restart-stale claim path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

What I checked:

Current predicate blocks reused current PID: Current main treats any fresh claim with a different processId and an existing PID as owned by another live process, so a reused current PID from an older gateway identity is protected rather than recovered. (extensions/telegram/src/telegram-ingress-spool.ts:178, 1a7669bc63a0)
Drain gate uses that predicate before claiming pending updates: The isolated ingress drain calls stale-claim recovery with staleMs 0, then skips recovery when active lane keys or the live-owner predicate say the processing file is still owned. (extensions/telegram/src/polling-session.ts:503, 1a7669bc63a0)
Current test documents the old blocking behavior: The current polling-session test writes a processing claim with processPid equal to process.pid and a different processId, then expects no recovery and only the later update pending. (extensions/telegram/src/polling-session.test.ts:894, 1a7669bc63a0)
PR patch is focused on the implicated predicate: The provided PR diff adds the current-PID exclusion to the live-owner predicate and updates the two focused regression expectations around restart-stale claims. (extensions/telegram/src/telegram-ingress-spool.ts:181, b1e2bd62b564)
Proof is test-only: The PR body reports copied output from focused Telegram spool/session Vitest files and explicitly says live Docker/Telegram replay against a real bot was not tested. (b1e2bd62b564)
Telegram review context requires stronger runtime proof: Matching maintainer notes call for real Telegram proof for behavior PRs touching Telegram transport or delivery-sensitive paths, so the current synthetic-only evidence is not enough for an external PR merge gate. (.agents/maintainer-notes/telegram.md:37, 1a7669bc63a0)

Likely related people:

obviyus: Ayaan Zaidi introduced the Telegram ingress spool files and the live-owner predicate in commit 357e3ec. (role: introduced behavior; confidence: high; commits: 357e3ecc65e3; files: extensions/telegram/src/telegram-ingress-spool.ts, extensions/telegram/src/telegram-ingress-spool.test.ts, extensions/telegram/src/polling-session.ts)
steipete: Peter Steinberger has the heaviest recent history on the Telegram polling/session files and release-adjacent Telegram refactors around this path. (role: recent area contributor; confidence: medium; commits: 50a2481652b6, 0c0f1e34cba0, 53f90af990a6; files: extensions/telegram/src/polling-session.ts, extensions/telegram/src/polling-session.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1a7669bc63a0.

clawsweeper · 2026-05-20T20:14:38Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

martingarramon

LGTM-with-concerns. The core fix is correct — adding `claim.claim.processPid !== process.pid` short-circuits the live-owner check for restart-stale claims that recycled the current gateway PID, and the `polling-session.test.ts` change from `claimedAt: now - STALE_MS - 1` to `claimedAt: now` is important: the old test was passing because `isFreshClaimOwner` returned false (stale path), not because it was exercising the PID-reuse case. The new test covers the actual bug path.

Two actionable concerns:

1. Heading format — blocks CI gate.
`## Real behavior proof (required for external PRs)` should be `## Real behavior proof` (exact heading, no suffix). The CI gate matches the exact string; the `(required for external PRs)` suffix is causing the `Real behavior proof` check failure. Content is complete and correct — only the heading needs trimming.

2. Missing symmetric live-owner test.
The `polling-session.test.ts` change converts the existing blocked-case test into a recovery-case test. That's correct, but there's now no test asserting the symmetric path: "different PID + different processId + fresh + processExists → claim remains blocked." The old test was inadvertently covering this (via the stale path). A focused test for the still-blocked case would close the gap.

The other four CI failures (`check-additional-extension-bundled`, `check-lint`, `check-test-types`, `checks-node-core-runtime-infra-state`) are main-churn unrelated to this diff.

fix(telegram): recover stale ingress claims after restart

b1e2bd6

openclaw-barnacle Bot added channel: telegram Channel integration: telegram size: XS triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 20, 2026

martingarramon reviewed May 20, 2026

View reviewed changes

clawsweeper Bot mentioned this pull request May 22, 2026

[Bug]: Telegram ingress spool — orphan claim from prior container blocks all inbound (processExists(1) false-positive in PID-namespaced runtimes) #85168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(telegram): recover stale ingress claims after restart#84690

fix(telegram): recover stale ingress claims after restart#84690
zerone0x wants to merge 1 commit into
openclaw:mainfrom
zerone0x:fix/84674-telegram-spool-stale-claim

zerone0x commented May 20, 2026

Uh oh!

clawsweeper Bot commented May 20, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 20, 2026

Uh oh!

martingarramon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zerone0x commented May 20, 2026

Summary

Motivation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Real behavior proof (required for external PRs)

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

clawsweeper Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 20, 2026

Uh oh!

martingarramon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clawsweeper Bot commented May 20, 2026 •

edited

Loading