Skip to content

fix(telegram): keep polling watchdog on getUpdates liveness#78646

Merged
steipete merged 1 commit intoopenclaw:mainfrom
ai-hpc:fix/telegram-polling-watchdog-getupdates
May 7, 2026
Merged

fix(telegram): keep polling watchdog on getUpdates liveness#78646
steipete merged 1 commit intoopenclaw:mainfrom
ai-hpc:fix/telegram-polling-watchdog-getupdates

Conversation

@ai-hpc
Copy link
Copy Markdown
Contributor

@ai-hpc ai-hpc commented May 6, 2026

Summary

  • Problem: Telegram polling stall recovery treated unrelated outbound Bot API activity as liveness for inbound getUpdates polling.
  • Why it matters: active sendMessage traffic could mask a wedged inbound polling loop, leaving Telegram replies silent until a manual restart.
  • What changed: make the stall watchdog depend on completed/stuck getUpdates liveness only, while keeping unrelated API elapsed time in diagnostics.
  • What did NOT change (scope boundary): this does not redesign Telegram transport rebuild behavior beyond ensuring the watchdog fires when inbound polling is stale.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: TelegramPollingLivenessTracker.detectStall() returned no stall when either getUpdates elapsed time or generic Bot API elapsed time was still within the threshold.
  • Missing detection / guardrail: tests covered stale polling and stale unrelated API calls, but not the case where stale getUpdates coincides with recent or in-flight non-polling API traffic.
  • Contributing context (if known): outbound Telegram API success proves the Bot API path is alive, but it does not prove inbound long-polling is still progressing.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/telegram/src/polling-liveness.test.ts, extensions/telegram/src/polling-session.test.ts
  • Scenario the test should lock in: stale getUpdates still triggers watchdog restart even when sendMessage recently succeeded or a non-getUpdates API call is in flight.
  • Why this is the smallest reliable guardrail: the regression is in the polling liveness decision and session watchdog behavior, so targeted tracker/session tests catch it without live Telegram credentials.
  • Existing test that already covers this (if any): existing stale polling tests covered the baseline restart path but not unrelated API masking.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram polling recovery now restarts stale inbound polling even if unrelated outbound Telegram API calls are active or recently succeeded.

Diagram (if applicable)

Before:
stale getUpdates + recent sendMessage -> watchdog suppressed -> inbound polling stays wedged

After:
stale getUpdates + recent sendMessage -> watchdog restart -> polling cycle recovers

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Ubuntu 24.04.4 LTS
  • Runtime/container: Node 22 / pnpm
  • Model/provider: N/A
  • Integration/channel (if any): Telegram plugin polling watchdog
  • Relevant config (redacted): targeted regression tests do not require a live token; live proof used TELEGRAM_BOT_TOKEN env fallback from a local redacted token file

Steps

  1. Create a stale getUpdates liveness state.
  2. Record unrelated Telegram API activity such as sendMessage success or an in-flight non-getUpdates API call.
  3. Fire the polling stall watchdog.

Expected

  • Watchdog reports a polling stall and restarts the polling cycle.

Actual

  • Before this fix, recent unrelated API activity suppressed the watchdog.
  • After this fix, stale getUpdates liveness controls the watchdog and restart proceeds.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Validation on the rebased branch:

pnpm exec oxfmt --check --threads=1 CHANGELOG.md extensions/telegram/src/polling-liveness.ts extensions/telegram/src/polling-liveness.test.ts extensions/telegram/src/polling-session.test.ts
All matched files use the correct format.

pnpm test extensions/telegram/src/polling-liveness.test.ts extensions/telegram/src/polling-session.test.ts -- --reporter=verbose
Test Files 2 passed (2)
Tests 23 passed (23)

Real behavior proof

  • Behavior or issue addressed: Telegram polling watchdog recovery should fire from stale getUpdates liveness even when unrelated outbound Bot API calls are active.
  • Real environment tested: Ubuntu 24.04.4 LTS, PR branch fix/telegram-polling-watchdog-getupdates, commit e301533582, Node v22.22.1, pnpm 10.33.2, real Telegram Bot API token from a local redacted token file, and a private DM chat with the bot.
  • Exact steps or command run after this patch: Called real Telegram Bot API getMe, read a recent private DM via getUpdates, sent a disabled-notification proof message with sendMessage, exercised the PR liveness code to verify stale getUpdates returns STALL after outbound Telegram activity, then ran an isolated source-mode Gateway on port 19986 across the watchdog window with TELEGRAM_BOT_TOKEN supplied via env fallback.
  • Evidence after fix: Copied live output from Ubuntu 24.04.4 LTS, token omitted:
telegram_getMe=ok botId=8656041674 username=set
telegram_recent_chat=found chatType=private updateId=198414331
telegram_sendMessage=ok source=updates chatId=6599824666 messageId=70
live_sendMessage_stale_getUpdates=STALL
live_liveness_message=Polling stall detected (active getUpdates stuck for 120s); forcing restart. [diag inFlight=1 outcome=started startedAt=0 finishedAt=n/a durationMs=n/a offset=123 apiElapsedMs=60001]

Telegram client also showed the real round trip:

[5/6/2026 3:30 PM] Crazy Cat: test
[5/6/2026 3:30 PM] Orinclaw Assistant: OpenClaw PR #78646 live watchdog proof 2026-05-06T22:30:53.827Z

Additional isolated Gateway live proof after the same patch:

branch=fix/telegram-polling-watchdog-getupdates
commit=e301533582
os=Ubuntu 24.04.4 LTS
mode=isolated source-mode Gateway, port 19986, real Telegram bot token from env fallback
telegram_provider_start=[default] starting provider (@orinclaw_ai_bot)
inbound_updates=real pending Telegram DM updates consumed by Gateway poller
window=2026-05-06T22:51:45+00:00..2026-05-06T22:55:47+00:00
samples=5
health=live on every sample
ready=true failing=[] on every sample
polling_stall_count=0
getupdates_conflict_count=0
telegram_provider_start_count=1
final_health={"ok":true,"status":"live"}
final_ready={"ready":true,"failing":[]}
shutdown=clean SIGINT after validation

Before-fix long-lived reproduction on parent commit d05415d603:

scenario=active getUpdates started at t=0, unrelated non-getUpdates API success every 30s, watchdog threshold=120000ms
sample_0 t=0s result=NO_STALL
sample_1 t=30s result=NO_STALL
sample_2 t=60s result=NO_STALL
sample_3 t=90s result=NO_STALL
sample_4 t=120s result=NO_STALL
sample_135s t=135s result=NO_STALL
final_expected=STALL
final_actual=NO_STALL
reproduced_bug=stale getUpdates exceeded threshold but watchdog stayed suppressed by unrelated API liveness
  • Observed result after fix: The bot successfully handled real getMe, getUpdates, and sendMessage; the watchdog returned STALL after real outbound Telegram activity; the isolated Gateway stayed live/ready across the watchdog window with one Telegram provider start, zero false Polling stall detected logs, and zero getUpdates conflict logs.
  • What was not tested: I did not run a multi-hour production soak or model-response verification in the isolated test home. The isolated home intentionally had no OpenAI auth, so agent replies failed after Telegram polling consumed inbound DM updates; that auth failure is separate from Telegram polling liveness.

Human Verification (required)

  • Verified scenarios: stale getUpdates with recent non-polling API success, stale getUpdates with recent in-flight non-polling API activity, stale getUpdates with newer in-flight non-polling activity, existing stale polling restart paths, pre-fix long-lived suppression reproduction, real Telegram Bot API getMe/getUpdates/sendMessage, and an isolated live Gateway Telegram polling run across the watchdog window.
  • Edge cases checked: diagnostic output keeps apiElapsedMs for debugging while not using generic API liveness to suppress stale polling recovery.
  • What was not tested: I did not run a multi-hour production soak or model-response verification in the isolated test home; live Telegram polling startup, real update consumption, and watchdog-window stability were verified with a real token.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: Telegram polling may restart while outbound API traffic is healthy.
    • Mitigation: this is intentional; outbound API health is not inbound getUpdates health, and the watchdog threshold/throttling still bounds restarts.

@openclaw-barnacle openclaw-barnacle Bot added channel: telegram Channel integration: telegram size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 6, 2026

Codex review: needs maintainer review before merge.

Summary
The PR changes Telegram polling liveness detection so stale getUpdates controls watchdog restarts even when unrelated Bot API calls are recent or in flight, and updates targeted tests plus the changelog.

Reproducibility: yes. A high-confidence source reproduction exists on current main: make getUpdates stale while a recent or in-flight non-getUpdates API call keeps apiElapsed under the threshold, and detectStall() returns null.

Real behavior proof
Sufficient (live_output): The PR body contains copied after-fix live output from a real Telegram Bot API setup plus an isolated Gateway polling-window run, with token details omitted.

Next step before merge
No repair lane is needed; the earlier changelog blocker is fixed and the remaining action is ordinary exact-head CI and maintainer merge validation.

Security
Cleared: The diff is limited to Telegram liveness logic, tests, and a changelog entry; it adds no dependency, workflow, permission, secret, or new execution surface.

Review details

Best possible solution:

Land this narrow watchdog fix after exact-head required checks finish green, while leaving broader transport-rebuild behavior tracked in the related open issue.

Do we have a high-confidence way to reproduce the issue?

Yes. A high-confidence source reproduction exists on current main: make getUpdates stale while a recent or in-flight non-getUpdates API call keeps apiElapsed under the threshold, and detectStall() returns null.

Is this the best way to solve the issue?

Yes. Removing generic Bot API liveness from the restart gate while retaining it as diagnostic output is the narrowest maintainable fix for the documented getUpdates watchdog behavior.

What I checked:

Likely related people:

  • vincentkoc: Path history shows recent merged work on the same Telegram polling watchdog/liveness/session area, including the wedged-runner watchdog fix. (role: recent Telegram polling watchdog maintainer; confidence: high; commits: ceace835563d; files: extensions/telegram/src/polling-liveness.ts, extensions/telegram/src/polling-session.ts, extensions/telegram/src/polling-liveness.test.ts)
  • steipete: Commit history for polling-liveness.ts shows the liveness tracker split and multiple adjacent Telegram polling/transport hardening commits. (role: liveness tracker introducer and adjacent maintainer; confidence: high; commits: 3eb48ec3e791, de1ac12f1c04, 1fb58ca5eee0; files: extensions/telegram/src/polling-liveness.ts, extensions/telegram/src/polling-session.ts, docs/channels/telegram.md)
  • obviyus: Recent Telegram polling startup and Telegram docs commits touch adjacent runtime behavior and documentation, but not the exact detectStall gate. (role: recent adjacent Telegram maintainer; confidence: medium; commits: 5b94c4ce9396, 814b125f114c; files: extensions/telegram/src/polling-session.ts, docs/channels/telegram.md)

Remaining risk / open question:

  • Exact-head CI was still queued at review time; merge should wait for required checks to finish green.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 120eb3426a14.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@ai-hpc ai-hpc force-pushed the fix/telegram-polling-watchdog-getupdates branch 2 times, most recently from cce6e7a to de0da45 Compare May 6, 2026 22:41
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. and removed proof: supplied External PR includes structured after-fix real behavior proof. proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@ai-hpc ai-hpc force-pushed the fix/telegram-polling-watchdog-getupdates branch from de0da45 to e301533 Compare May 6, 2026 23:36
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@steipete steipete force-pushed the fix/telegram-polling-watchdog-getupdates branch from e301533 to 1caf97e Compare May 7, 2026 00:36
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
@steipete steipete merged commit 440111f into openclaw:main May 7, 2026
93 checks passed
steipete pushed a commit that referenced this pull request May 7, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
rogerdigital pushed a commit to rogerdigital/openclaw that referenced this pull request May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

2 participants