Skip to content

fix(telegram): sanitize update offset + lock polling#3186

Closed
daxiong888 wants to merge 1 commit intoopenclaw:mainfrom
daxiong888:pr/telegram-offset-lock
Closed

fix(telegram): sanitize update offset + lock polling#3186
daxiong888 wants to merge 1 commit intoopenclaw:mainfrom
daxiong888:pr/telegram-offset-lock

Conversation

@daxiong888
Copy link

@daxiong888 daxiong888 commented Jan 28, 2026

This prevents Telegram polling from silently getting stuck when persisted state is corrupted and avoids multi-poller conflicts.

Changes:

  • Validate persisted lastUpdateId (must be a safe integer within the 32-bit Bot API range). If invalid, back up the file and start from a null offset.
  • Add a per-bot polling lock (account + token hash) to prevent multiple concurrent getUpdates loops.
  • Wire the lock + offset sanitize into the polling monitor.

Notes:

  • Cross-platform: no systemd dependencies.
  • Escape hatch: set CLAWDBOT_ALLOW_MULTI_TELEGRAM_POLL=1 to disable the poll lock.

Tests:

  • pnpm vitest run src/telegram/update-offset-store.test.ts src/telegram/poll-lock.test.ts

Greptile Overview

Greptile Summary

This PR hardens Telegram long-polling by (1) validating and backing up corrupted persisted lastUpdateId state before starting, and (2) adding a per-bot polling lock (account + token hash) to prevent multiple concurrent getUpdates loops. The monitor now acquires the lock before starting polling, logs invalid offset recovery via onInvalid, and ensures the lock is released on exit (including webhook mode).

Confidence Score: 4/5

  • This PR is generally safe to merge, with low-to-moderate risk around edge-case lock reclamation and test flakiness.
  • Core behavior changes (offset sanitization + file backup, single-poller lock acquisition/release) are straightforward and covered by targeted unit tests. Main remaining concerns are an edge case where lock owner detection can misclassify a reused PID as alive on Linux when startTime isn’t available, and the new timing-sensitive test which may be flaky under CI load.
  • src/telegram/poll-lock.ts, src/telegram/poll-lock.test.ts

(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!

- Ignore/backup invalid persisted lastUpdateId instead of getting stuck\n- Add a per-bot poll lock to avoid multiple getUpdates loops\n- Wire lock + offset sanitize into telegram monitor
@openclaw-barnacle openclaw-barnacle bot added the channel: telegram Channel integration: telegram label Jan 28, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +94 to +97
const args = readLinuxCmdline(pid);
if (!args) return "unknown";
// Best-effort: still running, but not necessarily a poller; treat unknown as alive unless stale.
return args.length > 0 ? "alive" : "unknown";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] On Linux, the fallback cmdline check treats any process with a non-empty /proc/<pid>/cmdline as "alive" (args.length > 0). If payload.startTime is missing/corrupted or /proc/<pid>/stat is unreadable, a reused PID from an unrelated process can keep the lock looking alive and prevent reclaim until the acquire loop times out.

Also appears in: stale cleanup only running when ownerStatus !== "alive" at src/telegram/poll-lock.ts:215-233.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/telegram/poll-lock.ts
Line: 94:97

Comment:
[P2] On Linux, the fallback `cmdline` check treats any process with a non-empty `/proc/<pid>/cmdline` as "alive" (`args.length > 0`). If `payload.startTime` is missing/corrupted or `/proc/<pid>/stat` is unreadable, a reused PID from an unrelated process can keep the lock looking alive and prevent reclaim until the acquire loop times out.

Also appears in: stale cleanup only running when `ownerStatus !== "alive"` at src/telegram/poll-lock.ts:215-233.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +23 to +41
it("prevents multiple concurrent pollers for same account+token", async () => {
await withTempStateDir(async () => {
const lock1 = await acquireTelegramPollLock({
token: "123:abc",
accountId: "default",
timeoutMs: 50,
pollIntervalMs: 5,
staleMs: 10_000,
});

await expect(
acquireTelegramPollLock({
token: "123:abc",
accountId: "default",
timeoutMs: 30,
pollIntervalMs: 5,
staleMs: 10_000,
}),
).rejects.toBeInstanceOf(TelegramPollLockError);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] This test uses very tight timing (timeoutMs: 30–50, pollIntervalMs: 5). On slower/busy CI runners or some filesystems (notably Windows), this can become flaky (e.g., second acquire not observing the lock in time, or timing out differently). Consider loosening the timeouts / using fake timers to make the assertion deterministic.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/telegram/poll-lock.test.ts
Line: 23:41

Comment:
[P3] This test uses very tight timing (`timeoutMs: 30–50`, `pollIntervalMs: 5`). On slower/busy CI runners or some filesystems (notably Windows), this can become flaky (e.g., second acquire not observing the lock in time, or timing out differently). Consider loosening the timeouts / using fake timers to make the assertion deterministic.

How can I resolve this? If you propose a fix, please make it concise.

@vincentkoc
Copy link
Member

This is still directionally useful, but it now overlaps with later merged offset work and current monitor changes.

To make this mergeable, please rebase and split into two pieces:

  1. offset sanitization/validation only,
  2. poll-lock behavior only.

That split will let us land low-risk pieces first and preserve author credit cleanly.

@vincentkoc
Copy link
Member

Additional code-level blockers after deeper review:

  1. Environment variable naming is stale (CLAWDBOT_ALLOW_MULTI_TELEGRAM_POLL, CLAWDBOT_STATE_DIR in tests) while current repo conventions use OPENCLAW_* paths/config. This makes lock behavior brittle and hard to reason about in current deployments.

  2. This PR now overlaps strongly with merged polling-offset work (fix(telegram): prevent update offset skipping queued updates #23284) and with newer token-scoping work under review (fix: scope Telegram update offset to bot token #11347), so it needs to be split and rebased before it can be evaluated safely.

Recommendation: close this as-is and reopen as two focused PRs (lock-only, sanitize-only) rebased on current main.

@vincentkoc
Copy link
Member

Appreciate the effort on this hardening pass.

I’m closing this as not mergeable in current form: it overlaps with newer merged/active offset lineage and uses stale CLAWDBOT_* conventions that no longer match current repo/runtime patterns.

If you want to continue this work, please open focused follow-up PRs (lock-only and sanitize-only) rebased on current main.

@vincentkoc vincentkoc closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants