Skip to content

fix(telegram): lower polling keepalive delay#83304

Merged
obviyus merged 7 commits into
openclaw:mainfrom
amittell:fix/telegram-polling-nat-keepalive
May 28, 2026
Merged

fix(telegram): lower polling keepalive delay#83304
obviyus merged 7 commits into
openclaw:mainfrom
amittell:fix/telegram-polling-nat-keepalive

Conversation

@amittell

@amittell amittell commented May 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Lower Telegram polling's TCP keepalive initial delay from undici's default 60s to 30s.

Telegram getUpdates long-poll requests can sit idle for a long time. On some home or office networks, NAT entries expire around the same time as undici's default first keepalive probe, leaving no safety margin before the connection goes stale. This patch makes the Telegram dispatcher pass explicit keepalive options for direct, env-proxy, and explicit-proxy transports.

The PR was narrowed after review:

  • No shared gateway health-policy change.
  • No channel connect-grace change.
  • No duplicate polling status/session test churn.

Test plan

  • node scripts/run-vitest.mjs extensions/telegram/src/fetch.test.ts
  • git diff --check

Real behavior proof

  • Behavior addressed: Telegram long-poll connections can silently stall when a NAT/router drops an idle connection before the first TCP keepalive probe refreshes it.
  • Real environment tested: Two production gateway hosts previously ran the Telegram transport keepalive subset: mac-mini.lan with two Telegram polling accounts and rh-bot.lan with one Telegram polling account. This narrowed final head was also verified locally with focused dispatcher-policy tests.
  • Exact steps or command run after this patch: production soak used openclaw gateway status and openclaw channels status; final narrowed PR head used node scripts/run-vitest.mjs extensions/telegram/src/fetch.test.ts and git diff --check.
  • Evidence after fix: copied live output from the production soak:
CLI version: 2026.5.12 (/opt/homebrew/bin/openclaw)
Gateway version: 2026.5.12
Runtime: running (pid 50650, state active)
Connectivity probe: ok
- Telegram default: enabled, configured, running, connected, in:12h ago, out:16m ago, mode:polling, token:config, groups:unmentioned
- Telegram ratbot: enabled, configured, running, connected, in:28h ago, out:28h ago, mode:polling, token:config, groups:unmentioned

fetch.test.ts now asserts that direct and env-proxy Telegram dispatchers carry keepAlive: true and keepAliveInitialDelay: 30_000 in the connect options.

  • Observed result after fix: The production gateways' Telegram channels stayed running, connected during the soak with runtime uptime measured in days rather than repeated restart cycles. On the narrowed final head, 37 Telegram fetch tests passed locally and the PR diff is narrowed to extensions/telegram/src/fetch.ts and extensions/telegram/src/fetch.test.ts.
  • What was not tested: Deliberately induced NAT expiry, because that needs a controlled router/NAT scenario. Final-head live Telegram deploy was not rerun after narrowing the branch.

Copilot AI review requested due to automatic review settings May 17, 2026 23:07
@openclaw-barnacle openclaw-barnacle Bot added channel: telegram Channel integration: telegram gateway Gateway runtime size: S labels May 17, 2026
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 17, 2026
@clawsweeper

clawsweeper Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 28, 2026, 12:37 AM ET / 04:37 UTC.

Summary
This PR adds explicit Telegram dispatcher TCP keepalive options with a 30s initial delay and focused fetch tests for direct and env-proxy dispatcher policies.

PR surface: Source +4, Tests +8. Total +12 across 2 files.

Reproducibility: no. high-confidence current-main reproduction path was established. Source inspection and undici 8.3.0 source show current main relies on the 60s default keepalive delay, but the PR discussion says controlled NAT expiry and final-head live Telegram deploy were not tested.

Review metrics: 1 noteworthy metric.

  • Telegram transport default: 1 keepalive timing changed; 3 dispatcher modes affected. The PR changes the effective TCP keepalive initial delay for Telegram polling across direct, env-proxy, and explicit-proxy transports, which is runtime behavior CI cannot fully simulate.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until stronger real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Post redacted final-head live Telegram polling proof from the narrowed head, including the command or gateway/channel status output used after the patch.
  • [P1] Add an explicit-proxy keepalive assertion if maintainers want test parity for every dispatcher mode named in the PR body.

Proof guidance:

  • [P1] Needs stronger real behavior proof before merge: The PR includes earlier production logs, but not a final-head live run; the contributor should add redacted final-head live output, logs, terminal proof, or a recording, then update the PR body so ClawSweeper re-reviews automatically or ask a maintainer for @clawsweeper re-review.

Mantis proof suggestion
A final-head live Telegram polling smoke would materially improve review confidence, even though it will not prove controlled NAT expiry by itself. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

telegram live: run a final-head Telegram polling smoke that sends and receives a message and includes redacted gateway/channel status logs.

Risk before merge

  • [P1] The live proof is not tied to the narrowed final head, so maintainers still do not have real Telegram evidence that the final patch improves polling transport behavior in a deployed gateway.
  • [P1] Changing TCP keepalive timing for Telegram long-poll sockets can affect real direct, env-proxy, and explicit-proxy networks in ways unit tests cannot settle, including socket churn, stalled delivery, or gateway availability.
  • [P1] The PR source maps the explicit-proxy path through requestTls, but the new keepalive assertions directly cover only direct and env-proxy dispatchers.

Maintainer options:

  1. Add final-head live Telegram proof (recommended)
    Have the contributor or a maintainer run the narrowed head with a Telegram polling account and post redacted gateway/channel output or logs showing the live transport path after the patch.
  2. Accept the transport timing risk
    A maintainer can land based on source, undici contract, and focused tests while explicitly owning that final-head NAT/proxy behavior was not re-run live.
  3. Pause for controlled NAT proof
    If maintainers need proof of the claimed NAT failure mode rather than a live smoke, pause this PR until a controlled router/NAT repro or equivalent network harness is available.

Next step before merge

  • [P1] Needs maintainer proof/risk handling rather than a code repair; automation cannot supply the contributor’s final-head Telegram network evidence.

Security
Cleared: The diff only changes Telegram fetch transport options and focused tests; I found no concrete security or supply-chain regression.

Review details

Best possible solution:

Land the narrowed Telegram-only dispatcher change after final-head live Telegram polling proof is supplied or maintainers explicitly accept the transport risk.

Do we have a high-confidence way to reproduce the issue?

No high-confidence current-main reproduction path was established. Source inspection and undici 8.3.0 source show current main relies on the 60s default keepalive delay, but the PR discussion says controlled NAT expiry and final-head live Telegram deploy were not tested.

Is this the best way to solve the issue?

Yes for code shape: the narrowed Telegram dispatcher change is more maintainable than the earlier gateway grace and health-policy changes. The remaining blocker is proof and merge-risk handling, not a different implementation direction.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 647e18aa04a2.

Label changes

Label justifications:

  • P1: The PR targets Telegram polling stalls that can interrupt live channel delivery and gateway availability for affected users.
  • merge-risk: 🚨 message-delivery: Merging changes the live Telegram long-poll transport behavior, so incorrect timing or proxy behavior could still stall or disrupt message delivery.
  • merge-risk: 🚨 availability: The change affects long-lived gateway sockets and needs live proof because socket churn or stale connections can affect channel and gateway availability.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR includes earlier production logs, but not a final-head live run; the contributor should add redacted final-head live output, logs, terminal proof, or a recording, then update the PR body so ClawSweeper re-reviews automatically or ask a maintainer for @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +4, Tests +8. Total +12 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 21 17 +4
Tests 1 8 0 +8
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 29 17 +12

What I checked:

  • Repository policy applied: Root AGENTS.md was read fully, extensions/AGENTS.md was read for the bundled plugin boundary, and the Telegram maintainer note says Telegram transport PRs need real Telegram proof, preferably bot-to-bot QA or an equivalent live probe. (.agents/maintainer-notes/telegram.md:28, 647e18aa04a2)
  • Current main behavior: Current main builds Telegram connect options only for auto-select, IPv4 forcing, or DNS lookup and has no keepAliveInitialDelay setting in the Telegram dispatcher policy. (extensions/telegram/src/fetch.ts:162, 647e18aa04a2)
  • PR implementation: The PR branch makes buildTelegramConnectOptions always return keepAlive: true and keepAliveInitialDelay: 30_000, then attaches that object to direct, env-proxy, and explicit-proxy policy paths. (extensions/telegram/src/fetch.ts:161, d31d6b4033d0)
  • PR tests: The PR adds a helper asserting keepAlive and keepAliveInitialDelay and wires it into the direct and env-proxy dispatcher tests; explicit-proxy coverage remains indirect through the existing requestTls path. (extensions/telegram/src/fetch.test.ts:280, d31d6b4033d0)
  • Dependency contract: OpenClaw pins undici 8.3.0, whose connector source sets TCP keepalive by default and uses a 60_000 ms default when keepAliveInitialDelay is undefined; the PR changes that timing to 30_000 for Telegram dispatchers. (package.json:1875, 647e18aa04a2)
  • Release/current-main check: The latest release tag v2026.5.26 also lacks keepAliveInitialDelay in Telegram fetch.ts, and no tag contains the PR head SHA, so this change is not already shipped or implemented on current main. (extensions/telegram/src/fetch.ts:162, 10ad3aa16068)

Likely related people:

  • steipete: Git history shows Peter Steinberger authored the fetch runtime SDK seam and recent Telegram network logging work, and shortlog shows the heaviest recent contribution count across the fetch/runtime files sampled. (role: recent area contributor; confidence: high; commits: a126d23f0df4, b96f5d94dba6; files: extensions/telegram/src/fetch.ts, src/infra/net/undici-runtime.ts)
  • obviyus: The central Telegram transport fallback chain that resolveTelegramDispatcherPolicy builds on appears in the sampled history as authored by Ayaan Zaidi, who maps to the obviyus account in the PR context. (role: transport fallback feature history; confidence: medium; commits: e4825a0f9385; files: extensions/telegram/src/fetch.ts)
  • vincentkoc: Vincent Koc authored the trusted explicit-proxy Telegram work that is adjacent to the requestTls/proxyTls path this PR now relies on for explicit-proxy keepalive behavior. (role: adjacent explicit-proxy owner; confidence: medium; commits: 4251ad6638e7; files: extensions/telegram/src/fetch.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses Telegram long-poll stalls and spurious "disconnected" channel restarts. Adds OS-level TCP keepalive to undici's connect options for api.telegram.org so NAT timeouts and dead sockets surface promptly, stops the polling status publisher from clobbering connected: false at every cycle start, and widens the shared channel connect-grace window from 120s to 300s so slow-handshake channels aren't flagged disconnected before they finish connecting.

Changes:

  • Always set keepAlive: true / keepAliveInitialDelay: 30_000 on the Telegram undici Agent connect options.
  • Remove the connected: false write from notePollingStart; only notePollingStop/notePollingError mark the channel disconnected.
  • Bump DEFAULT_CHANNEL_CONNECT_GRACE_MS from 120s to 300s and update related Telegram tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
extensions/telegram/src/fetch.ts Enables undici TCP keepalive on the Telegram getUpdates transport; returns connect opts unconditionally.
extensions/telegram/src/polling-status.ts Stops notePollingStart from publishing connected: false; keeps telemetry-only reset.
extensions/telegram/src/polling-status.test.ts Adds regression test asserting notePollingStart patch omits connected.
extensions/telegram/src/polling-session.test.ts Updates expectations: one (not two) connected:false patch per stall+recover cycle; verifies cycle-start telemetry reset.
src/gateway/channel-health-policy.ts Widens default channel connect grace from 120s to 300s with motivating comment.

Comment thread extensions/telegram/src/polling-status.ts Outdated
Comment thread extensions/telegram/src/polling-status.test.ts Outdated
@amittell amittell force-pushed the fix/telegram-polling-nat-keepalive branch from 98ed39b to c97dca4 Compare May 18, 2026 02:04
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. P1 High-priority user-facing bug, regression, or broken workflow. impact:message-loss Channel message delivery can be lost, duplicated, or misrouted. labels May 18, 2026
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 18, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. impact:crash-loop Crash, hang, restart loop, or process-level availability failure. labels May 18, 2026
amittell added a commit to amittell/openclaw that referenced this pull request May 18, 2026
…nnected:true

The widened 300s channel connect grace and the removal of connected:false from
notePollingStart left a path where a polling restart could hang forever
looking healthy. notePollingStart clears lastConnectedAt, lastEventAt, and
lastTransportActivityAt but deliberately omits connected, so server-channels'
patch-merge inherits a connected:true from the previous lifecycle. After grace,
evaluateChannelHealth's stale-socket branch requires lastTransportActivityAt
to be non-null and the connected:false branch is masked, so the channel sits
healthy with no first getUpdates.

Add a post-grace branch to evaluateChannelHealth that flags polling channels
as stale-socket when connected:true is paired with null lastConnectedAt and
null lastTransportActivityAt and a non-null lastStartAt. Scoped to mode:polling
so webhook channels and channels without continuous transport tracking are
not falsely flagged. Align TELEGRAM_POLLING_CONNECT_GRACE_MS in the Telegram
status diagnostic with DEFAULT_CHANNEL_CONNECT_GRACE_MS so openclaw channels
status agrees with the shared health monitor on the grace window. Refresh
the notePollingStart comment to point at the new evaluateChannelHealth branch.

Addresses clawsweeper review on openclaw#83304 (P1 connect-grace startup-hang, P2
diagnostic grace drift). Tests cover the new flagged path, the in-grace happy
path, and the prior-successful-connect happy path.
@amittell

Copy link
Copy Markdown
Contributor Author

Addressed both findings in 835553c (fix/telegram-polling-nat-keepalive).

P1 (preserve startup liveness after reconnects): added a post-grace branch to evaluateChannelHealth that flags polling channels as stale-socket when an inherited connected:true is paired with null lastConnectedAt AND null lastTransportActivityAt AND a non-null lastStartAt. Scoped to mode === "polling" so webhook channels and channels without continuous transport tracking are not falsely flagged.

P2 (align Telegram status grace with health grace): set TELEGRAM_POLLING_CONNECT_GRACE_MS = 300_000 in extensions/telegram/src/status-issues.ts with a comment cross-referencing DEFAULT_CHANNEL_CONNECT_GRACE_MS. openclaw channels status now suppresses polling-startup runtime issues for the same 300s window as the background health monitor.

Tests added in src/gateway/channel-health-policy.test.ts:

  • flags polling startups that hang before first transport activity (covers the new branch)
  • keeps polling channels healthy during the connect grace even with null transport (regression for the grace branch)
  • does not flag stale-startup when the polling channel has a prior successful connect (regression for legitimate quiet polling)

extensions/telegram/src/status.test.ts post-grace tests updated to lastStartAt: Date.now() - 301_000.

Acceptance criteria run locally and green:

node scripts/run-vitest.mjs \
  extensions/telegram/src/polling-status.test.ts \
  extensions/telegram/src/polling-session.test.ts \
  extensions/telegram/src/status.test.ts \
  extensions/telegram/src/fetch.test.ts \
  src/gateway/channel-health-policy.test.ts
# Tests 162 passed (162)

pnpm check:changed
# exit 0

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 18, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. impact:message-loss Channel message delivery can be lost, duplicated, or misrouted. impact:crash-loop Crash, hang, restart loop, or process-level availability failure. labels May 18, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 18, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 18, 2026
@openclaw-barnacle openclaw-barnacle Bot added triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: supplied External PR includes structured after-fix real behavior proof. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. gateway Gateway runtime size: S triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 28, 2026
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 28, 2026
@obviyus obviyus self-assigned this May 28, 2026
Yibei Ou and others added 7 commits May 28, 2026 04:39
…ent NAT timeout stalls

Long-polling connections to api.telegram.org stay idle for up to the
getUpdates timeout (~900 s). Most home/office NAT tables expire idle TCP
entries after 60–1800 s (commonly ~1000 s). When the NAT entry is
silently dropped the connection hangs rather than returning an error,
leaving the grammY runner stuck until the 90 s stall watchdog fires and
forces a restart cycle.

Fix: unconditionally set `keepAlive: true` and
`keepAliveInitialDelay: 30_000` (30 s) on the undici Agent `connect`
options built in `buildTelegramConnectOptions`. OS-level TCP keepalive
probes sent every ~75 s (OS default) will:
1. Refresh the NAT table entry before it expires.
2. Surface dead connections immediately with ETIMEDOUT instead of
   hanging forever.

The `return Object.keys(connect).length > 0 ? connect : null` guard is
also removed; `connect` is now always non-empty so it always returns the
object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 92e454c)
…iden channel connect grace to 300s

(cherry picked from commit 1ca963a)
…nnected:true

The widened 300s channel connect grace and the removal of connected:false from
notePollingStart left a path where a polling restart could hang forever
looking healthy. notePollingStart clears lastConnectedAt, lastEventAt, and
lastTransportActivityAt but deliberately omits connected, so server-channels'
patch-merge inherits a connected:true from the previous lifecycle. After grace,
evaluateChannelHealth's stale-socket branch requires lastTransportActivityAt
to be non-null and the connected:false branch is masked, so the channel sits
healthy with no first getUpdates.

Add a post-grace branch to evaluateChannelHealth that flags polling channels
as stale-socket when connected:true is paired with null lastConnectedAt and
null lastTransportActivityAt and a non-null lastStartAt. Scoped to mode:polling
so webhook channels and channels without continuous transport tracking are
not falsely flagged. Align TELEGRAM_POLLING_CONNECT_GRACE_MS in the Telegram
status diagnostic with DEFAULT_CHANNEL_CONNECT_GRACE_MS so openclaw channels
status agrees with the shared health monitor on the grace window. Refresh
the notePollingStart comment to point at the new evaluateChannelHealth branch.

Addresses clawsweeper review on openclaw#83304 (P1 connect-grace startup-hang, P2
diagnostic grace drift). Tests cover the new flagged path, the in-grace happy
path, and the prior-successful-connect happy path.
…ing startups

Defense in depth on top of 87db46c's notePollingStart connected:false fix.
The primary path (notePollingStart writes connected:false explicitly so
evaluateChannelHealth's existing connected===false branch catches a hung
restart) is unchanged. This adds a defensive post-grace branch that catches
the same hang via a different signature -- inherited connected:true paired
with null lastConnectedAt and null lastTransportActivityAt -- in case a
future code path forgets to clear the inherited connected flag on lifecycle
start. Scoped to mode:polling so webhook channels and channels without
continuous transport tracking are not falsely flagged.

Also bump lastStartAt: Date.now() - 121_000 to 301_000 in the spool-handler
timeout test added by upstream openclaw#83505 so it falls past the widened 300s
TELEGRAM_POLLING_CONNECT_GRACE_MS suppression window (mirroring the same
fixup already applied to the two adjacent polling-startup tests).
Drop the 120s -> 300s widening from this PR after maintainer feedback that
the extra grace masks real startup bugs. The defense-in-depth checks added
in earlier commits (notePollingStart clearing inherited connected state,
the stale-socket policy branch, the per-snapshot startup grace test) all
work fine at 120s and remain valuable on their own.

Reverts in:
- src/gateway/channel-health-policy.ts: DEFAULT_CHANNEL_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status-issues.ts: TELEGRAM_POLLING_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status.test.ts: lastStartAt 301_000 -> 121_000 (3 cases)

The new channel-health-policy.test.ts cases use explicit channelConnectGraceMs:
10_000 in the policy, so they are unaffected by the default constant change.
@obviyus obviyus force-pushed the fix/telegram-polling-nat-keepalive branch from d31d6b4 to 6462fff Compare May 28, 2026 04:42
@obviyus obviyus merged commit f7c32fc into openclaw:main May 28, 2026
90 of 91 checks passed
@obviyus

obviyus commented May 28, 2026

Copy link
Copy Markdown
Contributor

Landed via squash onto main.

  • Scoped tests: node scripts/run-vitest.mjs extensions/telegram/src/fetch.test.ts; git diff --check origin/main...HEAD; node ~/.codex/skills/custom/telegram-e2e-bot-to-bot/scripts/run-mock-sut-e2e.mjs --gateway-port 19979 --mock-port 19982 --text '@{sut} Please answer with OPENCLAW_E2E_OK only.' --expect OPENCLAW_E2E_OK
  • Real behavior proof: Telegram bot-to-bot smoke passed on rebased head; SUT replied OPENCLAW_E2E_OK in-thread, mockRequests=1
  • Changelog: not updated; CHANGELOG.md is release-owned for normal PRs, release-note context is in the PR body
  • Land commit: 6462fff
  • Merge commit: f7c32fc

Thanks @amittell!

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 28, 2026
* fix(telegram): enable TCP keepalive on getUpdates connections to prevent NAT timeout stalls

Long-polling connections to api.telegram.org stay idle for up to the
getUpdates timeout (~900 s). Most home/office NAT tables expire idle TCP
entries after 60–1800 s (commonly ~1000 s). When the NAT entry is
silently dropped the connection hangs rather than returning an error,
leaving the grammY runner stuck until the 90 s stall watchdog fires and
forces a restart cycle.

Fix: unconditionally set `keepAlive: true` and
`keepAliveInitialDelay: 30_000` (30 s) on the undici Agent `connect`
options built in `buildTelegramConnectOptions`. OS-level TCP keepalive
probes sent every ~75 s (OS default) will:
1. Refresh the NAT table entry before it expires.
2. Surface dead connections immediately with ETIMEDOUT instead of
   hanging forever.

The `return Object.keys(connect).length > 0 ? connect : null` guard is
also removed; `connect` is now always non-empty so it always returns the
object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 92e454c)

* fix(telegram): stop self-flagging disconnected on poll-cycle start; widen channel connect grace to 300s

(cherry picked from commit 1ca963a)

* fix(telegram): catch hung polling startups that preserve inherited connected:true

The widened 300s channel connect grace and the removal of connected:false from
notePollingStart left a path where a polling restart could hang forever
looking healthy. notePollingStart clears lastConnectedAt, lastEventAt, and
lastTransportActivityAt but deliberately omits connected, so server-channels'
patch-merge inherits a connected:true from the previous lifecycle. After grace,
evaluateChannelHealth's stale-socket branch requires lastTransportActivityAt
to be non-null and the connected:false branch is masked, so the channel sits
healthy with no first getUpdates.

Add a post-grace branch to evaluateChannelHealth that flags polling channels
as stale-socket when connected:true is paired with null lastConnectedAt and
null lastTransportActivityAt and a non-null lastStartAt. Scoped to mode:polling
so webhook channels and channels without continuous transport tracking are
not falsely flagged. Align TELEGRAM_POLLING_CONNECT_GRACE_MS in the Telegram
status diagnostic with DEFAULT_CHANNEL_CONNECT_GRACE_MS so openclaw channels
status agrees with the shared health monitor on the grace window. Refresh
the notePollingStart comment to point at the new evaluateChannelHealth branch.

Addresses clawsweeper review on openclaw#83304 (P1 connect-grace startup-hang, P2
diagnostic grace drift). Tests cover the new flagged path, the in-grace happy
path, and the prior-successful-connect happy path.

* fix(telegram): clear polling connected state on startup

* fix(gateway): add defense-in-depth health-policy branch for hung polling startups

Defense in depth on top of 87db46c's notePollingStart connected:false fix.
The primary path (notePollingStart writes connected:false explicitly so
evaluateChannelHealth's existing connected===false branch catches a hung
restart) is unchanged. This adds a defensive post-grace branch that catches
the same hang via a different signature -- inherited connected:true paired
with null lastConnectedAt and null lastTransportActivityAt -- in case a
future code path forgets to clear the inherited connected flag on lifecycle
start. Scoped to mode:polling so webhook channels and channels without
continuous transport tracking are not falsely flagged.

Also bump lastStartAt: Date.now() - 121_000 to 301_000 in the spool-handler
timeout test added by upstream openclaw#83505 so it falls past the widened 300s
TELEGRAM_POLLING_CONNECT_GRACE_MS suppression window (mirroring the same
fixup already applied to the two adjacent polling-startup tests).

* revert(telegram,gateway): keep connect grace at 120s

Drop the 120s -> 300s widening from this PR after maintainer feedback that
the extra grace masks real startup bugs. The defense-in-depth checks added
in earlier commits (notePollingStart clearing inherited connected state,
the stale-socket policy branch, the per-snapshot startup grace test) all
work fine at 120s and remain valuable on their own.

Reverts in:
- src/gateway/channel-health-policy.ts: DEFAULT_CHANNEL_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status-issues.ts: TELEGRAM_POLLING_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status.test.ts: lastStartAt 301_000 -> 121_000 (3 cases)

The new channel-health-policy.test.ts cases use explicit channelConnectGraceMs:
10_000 in the policy, so they are unaffected by the default constant change.

* fix(telegram): narrow polling keepalive fix

---------

Co-authored-by: Yibei Ou <yibeiou@Yibeis-Mac-mini.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
steipete pushed a commit that referenced this pull request May 28, 2026
* fix(telegram): enable TCP keepalive on getUpdates connections to prevent NAT timeout stalls

Long-polling connections to api.telegram.org stay idle for up to the
getUpdates timeout (~900 s). Most home/office NAT tables expire idle TCP
entries after 60–1800 s (commonly ~1000 s). When the NAT entry is
silently dropped the connection hangs rather than returning an error,
leaving the grammY runner stuck until the 90 s stall watchdog fires and
forces a restart cycle.

Fix: unconditionally set `keepAlive: true` and
`keepAliveInitialDelay: 30_000` (30 s) on the undici Agent `connect`
options built in `buildTelegramConnectOptions`. OS-level TCP keepalive
probes sent every ~75 s (OS default) will:
1. Refresh the NAT table entry before it expires.
2. Surface dead connections immediately with ETIMEDOUT instead of
   hanging forever.

The `return Object.keys(connect).length > 0 ? connect : null` guard is
also removed; `connect` is now always non-empty so it always returns the
object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 92e454c)

* fix(telegram): stop self-flagging disconnected on poll-cycle start; widen channel connect grace to 300s

(cherry picked from commit 1ca963a)

* fix(telegram): catch hung polling startups that preserve inherited connected:true

The widened 300s channel connect grace and the removal of connected:false from
notePollingStart left a path where a polling restart could hang forever
looking healthy. notePollingStart clears lastConnectedAt, lastEventAt, and
lastTransportActivityAt but deliberately omits connected, so server-channels'
patch-merge inherits a connected:true from the previous lifecycle. After grace,
evaluateChannelHealth's stale-socket branch requires lastTransportActivityAt
to be non-null and the connected:false branch is masked, so the channel sits
healthy with no first getUpdates.

Add a post-grace branch to evaluateChannelHealth that flags polling channels
as stale-socket when connected:true is paired with null lastConnectedAt and
null lastTransportActivityAt and a non-null lastStartAt. Scoped to mode:polling
so webhook channels and channels without continuous transport tracking are
not falsely flagged. Align TELEGRAM_POLLING_CONNECT_GRACE_MS in the Telegram
status diagnostic with DEFAULT_CHANNEL_CONNECT_GRACE_MS so openclaw channels
status agrees with the shared health monitor on the grace window. Refresh
the notePollingStart comment to point at the new evaluateChannelHealth branch.

Addresses clawsweeper review on #83304 (P1 connect-grace startup-hang, P2
diagnostic grace drift). Tests cover the new flagged path, the in-grace happy
path, and the prior-successful-connect happy path.

* fix(telegram): clear polling connected state on startup

* fix(gateway): add defense-in-depth health-policy branch for hung polling startups

Defense in depth on top of 87db46c's notePollingStart connected:false fix.
The primary path (notePollingStart writes connected:false explicitly so
evaluateChannelHealth's existing connected===false branch catches a hung
restart) is unchanged. This adds a defensive post-grace branch that catches
the same hang via a different signature -- inherited connected:true paired
with null lastConnectedAt and null lastTransportActivityAt -- in case a
future code path forgets to clear the inherited connected flag on lifecycle
start. Scoped to mode:polling so webhook channels and channels without
continuous transport tracking are not falsely flagged.

Also bump lastStartAt: Date.now() - 121_000 to 301_000 in the spool-handler
timeout test added by upstream #83505 so it falls past the widened 300s
TELEGRAM_POLLING_CONNECT_GRACE_MS suppression window (mirroring the same
fixup already applied to the two adjacent polling-startup tests).

* revert(telegram,gateway): keep connect grace at 120s

Drop the 120s -> 300s widening from this PR after maintainer feedback that
the extra grace masks real startup bugs. The defense-in-depth checks added
in earlier commits (notePollingStart clearing inherited connected state,
the stale-socket policy branch, the per-snapshot startup grace test) all
work fine at 120s and remain valuable on their own.

Reverts in:
- src/gateway/channel-health-policy.ts: DEFAULT_CHANNEL_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status-issues.ts: TELEGRAM_POLLING_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status.test.ts: lastStartAt 301_000 -> 121_000 (3 cases)

The new channel-health-policy.test.ts cases use explicit channelConnectGraceMs:
10_000 in the policy, so they are unaffected by the default constant change.

* fix(telegram): narrow polling keepalive fix

---------

Co-authored-by: Yibei Ou <yibeiou@Yibeis-Mac-mini.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
eleboucher pushed a commit to eleboucher/homelab that referenced this pull request May 31, 2026
…026.5.28) (#759)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [ghcr.io/openclaw/openclaw](https://openclaw.ai) ([source](https://github.com/openclaw/openclaw)) | patch | `2026.5.27` → `2026.5.28` |

---

### Release Notes

<details>
<summary>openclaw/openclaw (ghcr.io/openclaw/openclaw)</summary>

### [`v2026.5.28`](https://github.com/openclaw/openclaw/blob/HEAD/CHANGELOG.md#2026528)

[Compare Source](openclaw/openclaw@v2026.5.27...v2026.5.28)

##### Highlights

- Agent and Codex runtime recovery is steadier: subagents keep cwd/workspace separation, hook context stays prompt-local, session locks release on timeout abort while live OpenClaw locks survive cleanup, stale restart continuations are avoided, and Codex app-server/helper failures no longer tear down shared runtime state. ([#&#8203;87218](openclaw/openclaw#87218), [#&#8203;86875](openclaw/openclaw#86875), [#&#8203;87409](openclaw/openclaw#87409), [#&#8203;87399](openclaw/openclaw#87399), [#&#8203;87375](openclaw/openclaw#87375), [#&#8203;88129](openclaw/openclaw#88129))
- Channel delivery and session identity got safer across outbound plugin hooks, Matrix room ids, iMessage reactions/approvals, Slack final replies, Discord recovered tool warnings, runtime-config message actions, WhatsApp profile auth roots, Telegram polling, and Microsoft Teams service URL trust checks. ([#&#8203;73706](openclaw/openclaw#73706), [#&#8203;75670](openclaw/openclaw#75670), [#&#8203;87366](openclaw/openclaw#87366), [#&#8203;87451](openclaw/openclaw#87451), [#&#8203;87334](openclaw/openclaw#87334), [#&#8203;84535](openclaw/openclaw#84535), [#&#8203;82492](openclaw/openclaw#82492), [#&#8203;83304](openclaw/openclaw#83304), [#&#8203;87160](openclaw/openclaw#87160))
- Mobile and chat surfaces got a broader refresh: the iOS Pro UI, hosted push relay default, realtime Talk tab playback, Gateway chat transport, onboarding, Talk permissions, WebChat reconnect delivery, and session picker behavior now preserve more state across reconnects and empty searches. ([#&#8203;87367](openclaw/openclaw#87367), [#&#8203;87531](openclaw/openclaw#87531), [#&#8203;87682](openclaw/openclaw#87682), [#&#8203;88096](openclaw/openclaw#88096), [#&#8203;88105](openclaw/openclaw#88105)) Thanks [@&#8203;ngutman](https://github.com/ngutman) and [@&#8203;BunsDev](https://github.com/BunsDev).
- Browser, channel, and automation inputs are stricter: Browser tool timeouts, viewport/tab indices, Gateway ports, cron retry handling, Discord component ids, schema array refs, Telegram callback pages, and channel progress callbacks now reject malformed values earlier and preserve the intended delivery context. ([#&#8203;82887](openclaw/openclaw#82887))
- Provider, media, and document coverage expands with Claude Opus 4.8, Fal Krea image schemas, NVIDIA featured models, MiniMax streaming music responses, encrypted PDF extraction, voice model catalogs, GitHub Copilot agent runtime support, and a Codex Supervisor plugin path for delegated Codex workflows. ([#&#8203;87845](openclaw/openclaw#87845), [#&#8203;87890](openclaw/openclaw#87890), [#&#8203;80775](openclaw/openclaw#80775), [#&#8203;84764](openclaw/openclaw#84764), [#&#8203;87751](openclaw/openclaw#87751), [#&#8203;87794](openclaw/openclaw#87794))
- CLI, auth, doctor, and provider paths fail faster and recover more clearly: malformed numeric/version options are rejected, workspace dotenv provider credentials are ignored, heartbeat defaults, OAuth/token lifetimes, and local service startup requests are bounded, agent auth health labels are clearer, legacy `api_key` auth profiles migrate to canonical form, and restart guidance is actionable. ([#&#8203;87398](openclaw/openclaw#87398), [#&#8203;86281](openclaw/openclaw#86281), [#&#8203;87361](openclaw/openclaw#87361), [#&#8203;88133](openclaw/openclaw#88133), [#&#8203;83655](openclaw/openclaw#83655), [#&#8203;87559](openclaw/openclaw#87559), [#&#8203;88088](openclaw/openclaw#88088), [#&#8203;85924](openclaw/openclaw#85924)) Thanks [@&#8203;vincentkoc](https://github.com/vincentkoc) and [@&#8203;giodl73-repo](https://github.com/giodl73-repo).
- Plugin and Gateway hot paths do less repeated work while preserving cache correctness for install records, config JSON parsing, tool search catalogs, session stores, manifest model rows, auto-enabled plugin config, browser tokens, viewer assets, and release-split external plugin packages. ([#&#8203;86699](openclaw/openclaw#86699))
- Release, QA, and E2E validation now bound more log, artifact, harness, and cross-OS waits so failing lanes produce proof instead of hanging or false-greening.

##### Changes

- Status: show active subagent details in status output.
- Diffs: split the default language pack and expand default Diffs language coverage while keeping the host floor aligned. ([#&#8203;87370](openclaw/openclaw#87370), [#&#8203;87372](openclaw/openclaw#87372)) Thanks [@&#8203;RomneyDa](https://github.com/RomneyDa).
- ClawHub: add plugin display names plus skill verification and trust surfaces. ([#&#8203;87354](openclaw/openclaw#87354), [#&#8203;86699](openclaw/openclaw#86699)) Thanks [@&#8203;thewilloftheshadow](https://github.com/thewilloftheshadow) and [@&#8203;Patrick-Erichsen](https://github.com/Patrick-Erichsen).
- iOS: refresh the dev app with Pro Command, Chat, Agents, Settings, hosted push relay defaults, and realtime Talk playback wired to gateway sessions, diagnostics, chat, and realtime Talk. ([#&#8203;87367](openclaw/openclaw#87367), [#&#8203;88096](openclaw/openclaw#88096), [#&#8203;88105](openclaw/openclaw#88105)) Thanks [@&#8203;Solvely-Colin](https://github.com/Solvely-Colin) and [@&#8203;ngutman](https://github.com/ngutman).
- Docs: clarify Codex computer-use setup, paste-token stdin auth setup, macOS gateway sleep troubleshooting, native Codex hook relay recovery, container model auth, install deployment cards, device-token admin gating, CLI setup flow compatibility, Notte cloud browser CDP setup, and backport targets. ([#&#8203;87313](openclaw/openclaw#87313), [#&#8203;63050](openclaw/openclaw#63050), [#&#8203;87685](openclaw/openclaw#87685)) Thanks [@&#8203;bdjben](https://github.com/bdjben), [@&#8203;liaoandi](https://github.com/liaoandi), and [@&#8203;thewilloftheshadow](https://github.com/thewilloftheshadow).
- PDF/tools: use ClawPDF for PDF extraction, support encrypted PDF extraction, and surface MCP structured content in agent tool results. ([#&#8203;87670](openclaw/openclaw#87670), [#&#8203;87751](openclaw/openclaw#87751))
- Providers: add Claude Opus 4.8 support, Fal Krea image model schemas, NVIDIA featured model catalogs, MiniMax streaming music responses, and provider-backed voice model catalogs. ([#&#8203;87845](openclaw/openclaw#87845), [#&#8203;87890](openclaw/openclaw#87890), [#&#8203;80775](openclaw/openclaw#80775), [#&#8203;84764](openclaw/openclaw#84764), [#&#8203;87794](openclaw/openclaw#87794)) Thanks [@&#8203;eleqtrizit](https://github.com/eleqtrizit) and [@&#8203;vincentkoc](https://github.com/vincentkoc).
- Codex/GitHub: add the GitHub Copilot agent runtime and the Codex Supervisor plugin package.
- Plugins: externalize GitHub Copilot and Tokenjuice as official install-on-demand plugins with npm and ClawHub publish metadata.
- Workboard: add agent coordination tools for tracking and handing off active agent work.
- Discord: show commentary in progress drafts so live Discord runs expose useful in-progress context. ([#&#8203;85200](openclaw/openclaw#85200))
- Plugin SDK: add a reply payload sending hook for plugins that need to deliver channel-owned replies and flatten package types for SDK declarations. ([#&#8203;82823](openclaw/openclaw#82823), [#&#8203;87165](openclaw/openclaw#87165)) Thanks [@&#8203;piersonr](https://github.com/piersonr) and [@&#8203;RomneyDa](https://github.com/RomneyDa).
- Policy: add policy comparison, ingress-channel conformance, and sandbox-posture conformance checks. ([#&#8203;85572](openclaw/openclaw#85572), [#&#8203;85744](openclaw/openclaw#85744), [#&#8203;86768](openclaw/openclaw#86768))

##### Fixes

- Agents: fall back to local config pruning when the optional `agents delete` Gateway probe cannot authenticate, so offline installs can still delete agents without removing shared workspaces.
- Tighten phone-control mutation authorization \[AI]. ([#&#8203;87150](openclaw/openclaw#87150)) Thanks [@&#8203;pgondhi987](https://github.com/pgondhi987).
- Clarify directive persistence authorization policy \[AI]. ([#&#8203;86369](openclaw/openclaw#86369)) Thanks [@&#8203;pgondhi987](https://github.com/pgondhi987).
- Agents/Codex: keep spawned agent cwd/workspace state separated, forward ACP spawn attachments, keep hook context prompt-local, release session locks on timeout abort and runtime teardown without deleting live OpenClaw-owned locks during cleanup, avoid session event queue self-wait, clean up exec abort listeners, stream assistant deltas incrementally, recover raw missing-thread compaction failures, preserve rotated compaction session identity, keep compaction-timeout snapshots continuable, preserve shared app-server state across startup or helper failures, keep native hook relay alive across restarts and prune stale bridge files, close native hook relay replacement races, keep Claude live tool progress visible for watchdog recovery, suppress abandoned requester completion handoff, route workspace memory through tools, resolve Codex runtime models first, report quarantined dynamic tools, format `skills` command output, bind node auto-review to prepared plans, retry Claude CLI transcript probes, and bound compaction/steering retries. ([#&#8203;87218](openclaw/openclaw#87218), [#&#8203;86875](openclaw/openclaw#86875), [#&#8203;86123](openclaw/openclaw#86123), [#&#8203;88129](openclaw/openclaw#88129), [#&#8203;87399](openclaw/openclaw#87399), [#&#8203;87375](openclaw/openclaw#87375), [#&#8203;72574](openclaw/openclaw#72574), [#&#8203;87383](openclaw/openclaw#87383), [#&#8203;87400](openclaw/openclaw#87400), [#&#8203;83022](openclaw/openclaw#83022), [#&#8203;87671](openclaw/openclaw#87671), [#&#8203;87738](openclaw/openclaw#87738), [#&#8203;87747](openclaw/openclaw#87747), [#&#8203;87706](openclaw/openclaw#87706), [#&#8203;87546](openclaw/openclaw#87546), [#&#8203;87541](openclaw/openclaw#87541), [#&#8203;81048](openclaw/openclaw#81048)) Thanks [@&#8203;mbelinky](https://github.com/mbelinky), [@&#8203;Alix-007](https://github.com/Alix-007), [@&#8203;luoyanglang](https://github.com/luoyanglang), [@&#8203;yetval](https://github.com/yetval), [@&#8203;sjf](https://github.com/sjf), [@&#8203;joshavant](https://github.com/joshavant), [@&#8203;benjamin1492](https://github.com/benjamin1492), [@&#8203;c19354837](https://github.com/c19354837), [@&#8203;fuller-stack-dev](https://github.com/fuller-stack-dev), [@&#8203;pfrederiksen](https://github.com/pfrederiksen), and [@&#8203;dodge1218](https://github.com/dodge1218).
- Codex Supervisor: keep real-home app-server MCP session listing on the loaded state path, bound stored history scans, and close WebSocket probes cleanly.
- Channels: thread canonical session keys into outbound hooks, preserve Matrix room-id case, keep fallback tool warnings mention-inert, retain delivered Slack final replies during late cleanup, continue iMessage polling after denied reactions, suppress duplicate native exec approvals, resolve Gateway message actions against the active runtime config, preserve Telegram SecretRef prompt config and polling keepalives, preserve WhatsApp profile auth roots, QR display, document filenames, and plugin hook config, suppress Discord recovered tool warnings, preserve the Discord voice outbound helper, cap Discord/Signal/Zalo channel request and container timeouts, and block untrusted Teams service URLs while keeping TeamsSDK patterns aligned. ([#&#8203;73706](openclaw/openclaw#73706), [#&#8203;75670](openclaw/openclaw#75670), [#&#8203;87366](openclaw/openclaw#87366), [#&#8203;87451](openclaw/openclaw#87451), [#&#8203;87465](openclaw/openclaw#87465), [#&#8203;87334](openclaw/openclaw#87334), [#&#8203;84535](openclaw/openclaw#84535), [#&#8203;76262](openclaw/openclaw#76262), [#&#8203;83304](openclaw/openclaw#83304), [#&#8203;82492](openclaw/openclaw#82492), [#&#8203;87581](openclaw/openclaw#87581), [#&#8203;77114](openclaw/openclaw#77114), [#&#8203;86426](openclaw/openclaw#86426), [#&#8203;85529](openclaw/openclaw#85529), [#&#8203;87160](openclaw/openclaw#87160)) Thanks [@&#8203;zeroaltitude](https://github.com/zeroaltitude), [@&#8203;lukeboyett](https://github.com/lukeboyett), [@&#8203;jarvis-mns1](https://github.com/jarvis-mns1), [@&#8203;xiaotian](https://github.com/xiaotian), [@&#8203;funmerlin](https://github.com/funmerlin), [@&#8203;joshavant](https://github.com/joshavant), [@&#8203;eleqtrizit](https://github.com/eleqtrizit), [@&#8203;heyitsaamir](https://github.com/heyitsaamir), [@&#8203;amittell](https://github.com/amittell), [@&#8203;lidge-jun](https://github.com/lidge-jun), [@&#8203;liorb-mountapps](https://github.com/liorb-mountapps), [@&#8203;masatohoshino](https://github.com/masatohoshino), [@&#8203;bladin](https://github.com/bladin), and [@&#8203;giodl73-repo](https://github.com/giodl73-repo).
- CLI/auth/doctor/providers: reject malformed numeric/timeout/subcommand-version inputs, ignore workspace dotenv provider credentials, wait for respawn child shutdown, bound heartbeat defaults plus Codex, GitHub Copilot, OpenAI, Anthropic, Google, Feishu, LM Studio, MiniMax, Xiaomi TTS, and local-provider OAuth/token/model requests, harden Codex auth probes, label auth health by agent, preserve explicit agentRuntime pins during Codex model migration, warm provider auth off the main thread, honor Codex response timeouts, stop migrating current Claude Haiku 4.5 profiles to Sonnet, bound local service startup, resolve GPT-5.5 without cached catalog, migrate legacy memory auto-provider config, rewrite non-canonical `api_key` auth profiles, and make doctor restart follow-ups actionable. ([#&#8203;87398](openclaw/openclaw#87398), [#&#8203;86281](openclaw/openclaw#86281), [#&#8203;87361](openclaw/openclaw#87361), [#&#8203;88133](openclaw/openclaw#88133), [#&#8203;83655](openclaw/openclaw#83655), [#&#8203;87559](openclaw/openclaw#87559), [#&#8203;87719](openclaw/openclaw#87719), [#&#8203;88088](openclaw/openclaw#88088), [#&#8203;85924](openclaw/openclaw#85924), [#&#8203;84362](openclaw/openclaw#84362)) Thanks [@&#8203;Patrick-Erichsen](https://github.com/Patrick-Erichsen), [@&#8203;samzong](https://github.com/samzong), [@&#8203;giodl73-repo](https://github.com/giodl73-repo), [@&#8203;alkor2000](https://github.com/alkor2000), [@&#8203;mmaps](https://github.com/mmaps), [@&#8203;nxmxbbd](https://github.com/nxmxbbd), and [@&#8203;vincentkoc](https://github.com/vincentkoc).
- Gateway/security/session state: expire browser tokens after auth rotation, scope assistant idempotency dedupe, drain probe client closes, avoid stale restart continuation reuse, preserve retry-after fallbacks and stale rate-limit cooldown probes, bound webchat image and artifact transcript scans, include seconds in inbound metadata timestamps, clear completed session active runs, clear stale chat stream buffers, and evict current plugin-state namespaces at row caps. ([#&#8203;87810](openclaw/openclaw#87810), [#&#8203;87833](openclaw/openclaw#87833), [#&#8203;75089](openclaw/openclaw#75089)) Thanks [@&#8203;joshavant](https://github.com/joshavant) and [@&#8203;litang9](https://github.com/litang9).
- Config/parsing/network: reject partial numeric parsing, parse provider/Discord retry headers and dates strictly, honor IPv6 and bare IPv6 `no_proxy` entries, preserve empty plugin allowlists, canonicalize secret target array indexes, and reject malformed media content lengths, inspected TCP ports, marketplace content lengths, cron epochs, sandbox stat fields, unsafe duration values, empty config path segments, noncanonical schema array refs, unsafe Telegram callback pages, and invalid Teams attachment-fetch DNS targets. ([#&#8203;87883](openclaw/openclaw#87883)) Thanks [@&#8203;zhangguiping-xydt](https://github.com/zhangguiping-xydt).
- Browser/input hardening: reject invalid tab indexes, excessive viewport resizes, explicit zero CDP ports, malformed geolocation options, unsafe screenshot or permission-grant timeouts, loose response-body limits, invalid cookie expiries, and non-finite Browser tool delays/timeouts.
- Cron/automation: retry recurring jobs after transient model rate limits before waiting for the next scheduled slot, and preflight model fallbacks before skipping scheduled work. ([#&#8203;82887](openclaw/openclaw#82887)) Thanks [@&#8203;chen-zhang-cs-code](https://github.com/chen-zhang-cs-code).
- Auto-reply/directives: respect provider and relayed channel metadata during directive persistence so channel-originated decisions keep their intended context. ([#&#8203;87683](openclaw/openclaw#87683))
- WhatsApp: resolve the auth directory from the active profile so profile-scoped WhatsApp installs do not drift to the wrong credential root. ([#&#8203;82492](openclaw/openclaw#82492)) Thanks [@&#8203;lidge-jun](https://github.com/lidge-jun).
- Gateway/session state: clear completed session active runs, avoid cold-loading providers for MCP inventory, cache single-session child indexes, cap handshake timers, and bound preauth, auth-guard, media, transcript, readiness, and port options.
- Channels/replies: preserve channel-owned progress callbacks when verbose output is off, keep group-room progress suppression intact, prefer external session delivery context, escape Discord component id delimiters, force final TUI chat repaints, show Slack reasoning previews, and normalize Discord/Matrix/Mattermost channel numeric options. ([#&#8203;87476](openclaw/openclaw#87476), [#&#8203;87423](openclaw/openclaw#87423))
- Agents/tool args: harden smart-quoted argument repair for edit arrays and exact escaped arguments so model-produced tool calls recover without corrupting valid input. ([#&#8203;86611](openclaw/openclaw#86611)) Thanks [@&#8203;ferminquant](https://github.com/ferminquant).
- Providers/agents: preserve seeded Anthropic signatures, preserve signed thinking payloads, concatenate signature-delta chunks, preserve DeepSeek `reasoning_content` replay across tier suffixes, apply OpenRouter strict9 ids to Mistral routes, promote Ollama plain-text tool calls, load NVIDIA featured model catalogs, stream MiniMax music generation responses, and recover empty preflight compaction. ([#&#8203;87593](openclaw/openclaw#87593), [#&#8203;87493](openclaw/openclaw#87493), [#&#8203;80775](openclaw/openclaw#80775), [#&#8203;84764](openclaw/openclaw#84764)) Thanks [@&#8203;Pluviobyte](https://github.com/Pluviobyte) and [@&#8203;eleqtrizit](https://github.com/eleqtrizit).
- Media/images: skip CLI image cache refs when resolving generated images, allow trusted generated HTML attachments, and bound generated video downloads so stale refs and slow providers fail cleanly. ([#&#8203;87523](openclaw/openclaw#87523), [#&#8203;87982](openclaw/openclaw#87982))
- File transfer: handle late tar stdin pipe errors after archive validation or unpacking has already settled.
- Performance: trust install-record caches between reloads, prefer native JSON parsing, reuse unchanged tool-search catalogs, reuse gateway session and plugin metadata paths, skip unchanged store serialization, patch single-entry session writes, add precomputed session patch writers, reduce store clone allocations, cache manifest model catalog rows and auto-enabled plugin config, avoid full session snapshots for entry reads, defer configured Slack full startup, prefer bundled plugin dist entries, and slim current metadata identity caches. ([#&#8203;87760](openclaw/openclaw#87760))
- Docker/release/QA: package runtime workspace templates, stream cross-OS served artifacts, preserve sparse Crabbox run artifacts, isolate npm plugin installs per package, reject incompatible package plugin API installs, drop the leftover root Sharp dependency from package manifests after the Rastermill migration, bound OpenClaw instance logs, plugin gauntlet relay logs, MCP channel buffers, kitchen-sink scans, agent-turn assertions, QA-Lab credential broker calls, QA Matrix substrate requests, and release scenario logs, and keep release/google live guards current. ([#&#8203;87647](openclaw/openclaw#87647), [#&#8203;87477](openclaw/openclaw#87477)) Thanks [@&#8203;rohitjavvadi](https://github.com/rohitjavvadi) and [@&#8203;vincentkoc](https://github.com/vincentkoc).
- Release/CI: bound manual git fetches, ClawHub verifier responses, ClawHub owner metadata, dependency-guard error bodies, Parallels limits, startup/test/memory budget parsing, and diffs viewer build warnings so release lanes fail with useful proof instead of hanging. ([#&#8203;87839](openclaw/openclaw#87839))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about these updates again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDEuMSIsInVwZGF0ZWRJblZlciI6IjQzLjEwMS4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9jb250YWluZXIiLCJ0eXBlL3BhdGNoIl19-->

Reviewed-on: https://git.erwanleboucher.dev/eleboucher/homelab/pulls/759
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
* fix(telegram): enable TCP keepalive on getUpdates connections to prevent NAT timeout stalls

Long-polling connections to api.telegram.org stay idle for up to the
getUpdates timeout (~900 s). Most home/office NAT tables expire idle TCP
entries after 60–1800 s (commonly ~1000 s). When the NAT entry is
silently dropped the connection hangs rather than returning an error,
leaving the grammY runner stuck until the 90 s stall watchdog fires and
forces a restart cycle.

Fix: unconditionally set `keepAlive: true` and
`keepAliveInitialDelay: 30_000` (30 s) on the undici Agent `connect`
options built in `buildTelegramConnectOptions`. OS-level TCP keepalive
probes sent every ~75 s (OS default) will:
1. Refresh the NAT table entry before it expires.
2. Surface dead connections immediately with ETIMEDOUT instead of
   hanging forever.

The `return Object.keys(connect).length > 0 ? connect : null` guard is
also removed; `connect` is now always non-empty so it always returns the
object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 92e454c)

* fix(telegram): stop self-flagging disconnected on poll-cycle start; widen channel connect grace to 300s

(cherry picked from commit 1ca963a)

* fix(telegram): catch hung polling startups that preserve inherited connected:true

The widened 300s channel connect grace and the removal of connected:false from
notePollingStart left a path where a polling restart could hang forever
looking healthy. notePollingStart clears lastConnectedAt, lastEventAt, and
lastTransportActivityAt but deliberately omits connected, so server-channels'
patch-merge inherits a connected:true from the previous lifecycle. After grace,
evaluateChannelHealth's stale-socket branch requires lastTransportActivityAt
to be non-null and the connected:false branch is masked, so the channel sits
healthy with no first getUpdates.

Add a post-grace branch to evaluateChannelHealth that flags polling channels
as stale-socket when connected:true is paired with null lastConnectedAt and
null lastTransportActivityAt and a non-null lastStartAt. Scoped to mode:polling
so webhook channels and channels without continuous transport tracking are
not falsely flagged. Align TELEGRAM_POLLING_CONNECT_GRACE_MS in the Telegram
status diagnostic with DEFAULT_CHANNEL_CONNECT_GRACE_MS so openclaw channels
status agrees with the shared health monitor on the grace window. Refresh
the notePollingStart comment to point at the new evaluateChannelHealth branch.

Addresses clawsweeper review on openclaw#83304 (P1 connect-grace startup-hang, P2
diagnostic grace drift). Tests cover the new flagged path, the in-grace happy
path, and the prior-successful-connect happy path.

* fix(telegram): clear polling connected state on startup

* fix(gateway): add defense-in-depth health-policy branch for hung polling startups

Defense in depth on top of 87db46c's notePollingStart connected:false fix.
The primary path (notePollingStart writes connected:false explicitly so
evaluateChannelHealth's existing connected===false branch catches a hung
restart) is unchanged. This adds a defensive post-grace branch that catches
the same hang via a different signature -- inherited connected:true paired
with null lastConnectedAt and null lastTransportActivityAt -- in case a
future code path forgets to clear the inherited connected flag on lifecycle
start. Scoped to mode:polling so webhook channels and channels without
continuous transport tracking are not falsely flagged.

Also bump lastStartAt: Date.now() - 121_000 to 301_000 in the spool-handler
timeout test added by upstream openclaw#83505 so it falls past the widened 300s
TELEGRAM_POLLING_CONNECT_GRACE_MS suppression window (mirroring the same
fixup already applied to the two adjacent polling-startup tests).

* revert(telegram,gateway): keep connect grace at 120s

Drop the 120s -> 300s widening from this PR after maintainer feedback that
the extra grace masks real startup bugs. The defense-in-depth checks added
in earlier commits (notePollingStart clearing inherited connected state,
the stale-socket policy branch, the per-snapshot startup grace test) all
work fine at 120s and remain valuable on their own.

Reverts in:
- src/gateway/channel-health-policy.ts: DEFAULT_CHANNEL_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status-issues.ts: TELEGRAM_POLLING_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status.test.ts: lastStartAt 301_000 -> 121_000 (3 cases)

The new channel-health-policy.test.ts cases use explicit channelConnectGraceMs:
10_000 in the policy, so they are unaffected by the default constant change.

* fix(telegram): narrow polling keepalive fix

---------

Co-authored-by: Yibei Ou <yibeiou@Yibeis-Mac-mini.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
* fix(telegram): enable TCP keepalive on getUpdates connections to prevent NAT timeout stalls

Long-polling connections to api.telegram.org stay idle for up to the
getUpdates timeout (~900 s). Most home/office NAT tables expire idle TCP
entries after 60–1800 s (commonly ~1000 s). When the NAT entry is
silently dropped the connection hangs rather than returning an error,
leaving the grammY runner stuck until the 90 s stall watchdog fires and
forces a restart cycle.

Fix: unconditionally set `keepAlive: true` and
`keepAliveInitialDelay: 30_000` (30 s) on the undici Agent `connect`
options built in `buildTelegramConnectOptions`. OS-level TCP keepalive
probes sent every ~75 s (OS default) will:
1. Refresh the NAT table entry before it expires.
2. Surface dead connections immediately with ETIMEDOUT instead of
   hanging forever.

The `return Object.keys(connect).length > 0 ? connect : null` guard is
also removed; `connect` is now always non-empty so it always returns the
object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 92e454c)

* fix(telegram): stop self-flagging disconnected on poll-cycle start; widen channel connect grace to 300s

(cherry picked from commit 1ca963a)

* fix(telegram): catch hung polling startups that preserve inherited connected:true

The widened 300s channel connect grace and the removal of connected:false from
notePollingStart left a path where a polling restart could hang forever
looking healthy. notePollingStart clears lastConnectedAt, lastEventAt, and
lastTransportActivityAt but deliberately omits connected, so server-channels'
patch-merge inherits a connected:true from the previous lifecycle. After grace,
evaluateChannelHealth's stale-socket branch requires lastTransportActivityAt
to be non-null and the connected:false branch is masked, so the channel sits
healthy with no first getUpdates.

Add a post-grace branch to evaluateChannelHealth that flags polling channels
as stale-socket when connected:true is paired with null lastConnectedAt and
null lastTransportActivityAt and a non-null lastStartAt. Scoped to mode:polling
so webhook channels and channels without continuous transport tracking are
not falsely flagged. Align TELEGRAM_POLLING_CONNECT_GRACE_MS in the Telegram
status diagnostic with DEFAULT_CHANNEL_CONNECT_GRACE_MS so openclaw channels
status agrees with the shared health monitor on the grace window. Refresh
the notePollingStart comment to point at the new evaluateChannelHealth branch.

Addresses clawsweeper review on openclaw#83304 (P1 connect-grace startup-hang, P2
diagnostic grace drift). Tests cover the new flagged path, the in-grace happy
path, and the prior-successful-connect happy path.

* fix(telegram): clear polling connected state on startup

* fix(gateway): add defense-in-depth health-policy branch for hung polling startups

Defense in depth on top of 87db46c's notePollingStart connected:false fix.
The primary path (notePollingStart writes connected:false explicitly so
evaluateChannelHealth's existing connected===false branch catches a hung
restart) is unchanged. This adds a defensive post-grace branch that catches
the same hang via a different signature -- inherited connected:true paired
with null lastConnectedAt and null lastTransportActivityAt -- in case a
future code path forgets to clear the inherited connected flag on lifecycle
start. Scoped to mode:polling so webhook channels and channels without
continuous transport tracking are not falsely flagged.

Also bump lastStartAt: Date.now() - 121_000 to 301_000 in the spool-handler
timeout test added by upstream openclaw#83505 so it falls past the widened 300s
TELEGRAM_POLLING_CONNECT_GRACE_MS suppression window (mirroring the same
fixup already applied to the two adjacent polling-startup tests).

* revert(telegram,gateway): keep connect grace at 120s

Drop the 120s -> 300s widening from this PR after maintainer feedback that
the extra grace masks real startup bugs. The defense-in-depth checks added
in earlier commits (notePollingStart clearing inherited connected state,
the stale-socket policy branch, the per-snapshot startup grace test) all
work fine at 120s and remain valuable on their own.

Reverts in:
- src/gateway/channel-health-policy.ts: DEFAULT_CHANNEL_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status-issues.ts: TELEGRAM_POLLING_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status.test.ts: lastStartAt 301_000 -> 121_000 (3 cases)

The new channel-health-policy.test.ts cases use explicit channelConnectGraceMs:
10_000 in the policy, so they are unaffected by the default constant change.

* fix(telegram): narrow polling keepalive fix

---------

Co-authored-by: Yibei Ou <yibeiou@Yibeis-Mac-mini.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. P1 High-priority user-facing bug, regression, or broken workflow. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. size: XS status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants