fix(telegram): recover sticky fetch fallback after transient failures by MkDev11 · Pull Request #77157 · openclaw/openclaw

MkDev11 · 2026-05-04T06:41:50Z

Summary

Problem: Telegram fetch sticky fallback only promoted from primary to IPv4/pinned-IP transports and never returned to primary.
Why it matters: transient Telegram egress failures could leave a gateway on degraded transport until restart.
What changed: after repeated successful sticky fallback requests, Telegram fetch performs one primary recovery probe and resets/demotes sticky state only on successful transport recovery.
What did NOT change (scope boundary): no config knobs, fallback IP changes, dispatcher pool changes, or Telegram send/polling API changes.

Change Type

Scope

Linked Issue/PR

Closes [Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart #77088
Related #N/A
This PR fixes a bug or regression

Root Cause

Root cause: stickyAttemptIndex was monotonic and had no success-path recovery logic.
Missing detection / guardrail: existing tests asserted sticky promotion but not recovery after transient failure.
Contributing context (if known): Telegram transport fallback is useful for persistent IPv6/DNS issues, but needed a bounded recovery path.

Regression Test Plan

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/telegram/src/fetch.test.ts
Scenario the test should lock in: sticky IPv4 and pinned-IP fallback recover to primary after repeated successes, while a failed primary probe keeps fallback sticky.
Why this is the smallest reliable guardrail: it tests the transport state machine directly with mocked fetch dispatchers.
Existing test that already covers this (if any): existing sticky fallback tests covered promotion only.
If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram transport can recover from sticky IPv4/pinned-IP fallback without restarting the gateway after the primary path becomes healthy again.

Diagram

Before:
primary failure -> sticky fallback -> remains degraded until restart

After:
primary failure -> sticky fallback -> recovery probe -> primary restored when healthy

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: Linux
Runtime/container: Node 24.13.0
Model/provider: N/A
Integration/channel (if any): Telegram
Relevant config (redacted): default Telegram fetch transport with fallback enabled

Steps

Simulate a transient primary Telegram fetch failure.
Observe sticky fallback promotion to IPv4 or pinned-IP dispatcher.
Let fallback requests succeed enough to trigger recovery probing.
Simulate primary recovery.

Expected

Sticky fallback resets to primary after a successful primary recovery probe.

Actual

Before this fix, sticky fallback remained degraded until process restart.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

pnpm test extensions/telegram/src/fetch.test.ts
pnpm test extensions/telegram/src/polling-transport-state.test.ts extensions/telegram/src/polling-session.test.ts
pnpm test extensions/telegram/src/send.test.ts
pnpm check:changed

Human Verification

What you personally verified (not just CI), and how:

Verified scenarios: IPv4 sticky fallback recovery, pinned-IP sticky fallback recovery, failed primary recovery probe retaining fallback.
Edge cases checked: caller-provided dispatchers do not advance recovery state; all-attempt failure still leaves armed fallback sticky.
What you did not verify: live overnight flaky-network saturation behind GFW.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: Primary transport could still be unhealthy when probed.
- Mitigation: the probe is bounded to one normal request after repeated sticky successes; if primary fails, the same request falls back and sticky fallback remains active.

Real behavior proof

Behavior addressed: Telegram fetch sticky fallback now recovers from IPv4/pinned-IP fallback to the primary transport after repeated successful fallback requests, instead of staying degraded until gateway restart.
Real environment tested: Local OpenClaw checkout on Linux with Node 24.13.0, exercising the Telegram fetch transport state machine after this patch. No live Telegram bot token was used because the reported outage requires a flaky GFW-style egress path, but the runtime transport behavior is covered directly.
Exact steps or command run after this patch:

pnpm test extensions/telegram/src/fetch.test.ts

Evidence after fix:

extensions/telegram/src/fetch.test.ts passed locally on Node 24.13.0.
The after-fix output covered the sticky IPv4 fallback recovery path, the sticky pinned-IP fallback recovery path, and the failed-primary-probe path that keeps fallback sticky.

Observed result after fix: After repeated successful fallback requests, the next recovery probe can use the primary transport again; when the primary probe fails, the request still falls back and sticky fallback remains active.
What was not tested: A live overnight Telegram bot run behind the same flaky GFW/DC4 blackhole environment from [Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart #77088; that environment-dependent saturation path was not available locally.

clawsweeper · 2026-05-04T06:45:21Z

Codex review: needs real behavior proof before merge.

Summary
The PR adds Telegram sticky fallback success tracking, a bounded primary recovery probe/demotion path, focused regression tests, and a changelog entry.

Reproducibility: yes. at source level: current main only promotes stickyAttemptIndex, starts later requests from the promoted dispatcher, and existing tests assert fallback reuse. I did not establish a live overnight flaky-egress reproduction path.

Real behavior proof
Needs real behavior proof before merge: The PR body supplies local mocked Vitest proof only; it still needs redacted non-mock terminal output, logs, screenshot, recording, or linked artifact from a real after-fix OpenClaw/Telegram runtime path. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Contributor proof is the remaining blocker; an automated repair cannot provide non-mock evidence from the contributor's runtime setup.

Security
Cleared: The diff only changes Telegram fetch state tracking, tests, and changelog text; it does not change dependencies, workflows, permissions, secrets handling, package resolution, or code execution paths.

Review details

Best possible solution:

Land the focused Telegram transport recovery after redacted real-runtime proof is added, leaving pool-size, fallback-IP, and config-knob tuning as separate decisions.

Do we have a high-confidence way to reproduce the issue?

Yes at source level: current main only promotes stickyAttemptIndex, starts later requests from the promoted dispatcher, and existing tests assert fallback reuse. I did not establish a live overnight flaky-egress reproduction path.

Is this the best way to solve the issue?

Yes for the code direction: a bounded primary recovery probe inside the Telegram plugin transport is narrower than adding config knobs or changing fallback IP policy. The PR is not merge-ready until non-mock after-fix runtime proof is supplied.

Acceptance criteria:

Contributor-supplied redacted real behavior proof from an after-fix OpenClaw/Telegram runtime path
pnpm test extensions/telegram/src/fetch.test.ts
pnpm test extensions/telegram/src/polling-transport-state.test.ts extensions/telegram/src/polling-session.test.ts
pnpm test extensions/telegram/src/send.test.ts
pnpm check:changed

What I checked:

Current main sticky state only promotes: Current main initializes stickyAttemptIndex at 0, starts each request from that index, and only promotes to higher fallback indexes; there is no success path back to primary. (extensions/telegram/src/fetch.ts:642, 4e983aa57b8b)
Current tests assert sticky reuse: Current main tests expect IPv4 and pinned-IP fallback dispatchers to remain sticky for subsequent requests. (extensions/telegram/src/fetch.test.ts:790, 4e983aa57b8b)
PR adds recovery state machine: The PR head adds a five-success threshold, schedules a primary probe, and demotes sticky fallback after a lower-index attempt succeeds. (extensions/telegram/src/fetch.ts:643, bf029b95ae6a)
PR adds focused regression coverage: The PR head updates the IPv4 and pinned-IP sticky tests to assert recovery to the primary dispatcher and adds failed-primary-probe coverage that keeps fallback sticky. (extensions/telegram/src/fetch.test.ts:790, bf029b95ae6a)
Changelog present: The PR adds an Unreleased Telegram changelog entry for sticky fallback recovery and credits the contributor. (CHANGELOG.md:9, bf029b95ae6a)
Proof remains mock-only: The PR body's proof is pnpm test extensions/telegram/src/fetch.test.ts and explicitly says no live Telegram bot token or flaky-egress runtime run was performed. (bf029b95ae6a)

Likely related people:

obviyus: Authored the merged change that unified Telegram API and media fetches under the sticky IPv4 and pinned-IP fallback chain that this PR updates. (role: introduced affected fallback chain; confidence: high; commits: e4825a0f9385; files: extensions/telegram/src/fetch.ts, extensions/telegram/src/fetch.test.ts)
steipete: Recent commits around Telegram startup fallback retries, managed proxy handling, fallback log levels, and Telegram docs touch the same transport surface. (role: recent maintainer and likely follow-up owner; confidence: high; commits: 74a667f119cf, dc9f1b8525b1, e873c1e1f815; files: extensions/telegram/src/fetch.ts, extensions/telegram/src/fetch.test.ts, docs/channels/telegram.md)

Remaining risk / open question:

The contributor-supplied proof is still mocked Vitest output, not a redacted live or real-runtime Telegram/OpenClaw run.
The full event-loop saturation symptom depends on flaky Telegram egress and was not live-reproduced, though the source-level sticky fallback root cause is clear.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4e983aa57b8b.

MkDev11 · 2026-05-04T07:16:32Z

@clawsweeper

obviyus · 2026-05-08T03:45:58Z

Landed via rebase onto main.

Scoped tests: /Users/obviyus/Developer/openclaw/node_modules/.bin/oxfmt --check --threads=1 CHANGELOG.md extensions/telegram/src/fetch.ts extensions/telegram/src/fetch.test.ts; OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/tmp/openclaw-vitest-pr77157-land-fetch2 /Users/obviyus/Developer/openclaw/node_modules/.bin/vitest run extensions/telegram/src/fetch.test.ts
Changelog: CHANGELOG.md updated
Land commit: 1bf2801
Merge commit: 252456e

Thanks @MkDev11!

openclaw-barnacle Bot added channel: telegram Channel integration: telegram size: S labels May 4, 2026

claw-dev-555 mentioned this pull request May 4, 2026

[Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart #77088

Closed

clawsweeper Bot mentioned this pull request May 7, 2026

fix/issue-77088-telegram-sticky-recovery #79009

Closed

obviyus force-pushed the fix/issue-77088-telegram-sticky-recovery branch 2 times, most recently from bf029b9 to e78ef27 Compare May 8, 2026 03:33

fix(telegram): recover sticky fallback transport

1bf2801

obviyus force-pushed the fix/issue-77088-telegram-sticky-recovery branch from e78ef27 to 1bf2801 Compare May 8, 2026 03:39

obviyus merged commit 252456e into openclaw:main May 8, 2026
104 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(telegram): recover sticky fetch fallback after transient failures#77157

fix(telegram): recover sticky fetch fallback after transient failures#77157
obviyus merged 1 commit intoopenclaw:mainfrom
MkDev11:fix/issue-77088-telegram-sticky-recovery

MkDev11 commented May 4, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 4, 2026 •

edited

Loading

Uh oh!

MkDev11 commented May 4, 2026

Uh oh!

Uh oh!

obviyus commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MkDev11 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

Scope

Linked Issue/PR

Root Cause

Regression Test Plan

User-visible / Behavior Changes

Diagram

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification

Review Conversations

Compatibility / Migration

Risks and Mitigations

Real behavior proof

Uh oh!

clawsweeper Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MkDev11 commented May 4, 2026

Uh oh!

Uh oh!

obviyus commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MkDev11 commented May 4, 2026 •

edited

Loading

clawsweeper Bot commented May 4, 2026 •

edited

Loading