Skip to content

fix(telegram): recover sticky fetch fallback after transient failures#77157

Merged
obviyus merged 1 commit intoopenclaw:mainfrom
MkDev11:fix/issue-77088-telegram-sticky-recovery
May 8, 2026
Merged

fix(telegram): recover sticky fetch fallback after transient failures#77157
obviyus merged 1 commit intoopenclaw:mainfrom
MkDev11:fix/issue-77088-telegram-sticky-recovery

Conversation

@MkDev11
Copy link
Copy Markdown
Contributor

@MkDev11 MkDev11 commented May 4, 2026

Summary

  • Problem: Telegram fetch sticky fallback only promoted from primary to IPv4/pinned-IP transports and never returned to primary.
  • Why it matters: transient Telegram egress failures could leave a gateway on degraded transport until restart.
  • What changed: after repeated successful sticky fallback requests, Telegram fetch performs one primary recovery probe and resets/demotes sticky state only on successful transport recovery.
  • What did NOT change (scope boundary): no config knobs, fallback IP changes, dispatcher pool changes, or Telegram send/polling API changes.

Change Type

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause

  • Root cause: stickyAttemptIndex was monotonic and had no success-path recovery logic.
  • Missing detection / guardrail: existing tests asserted sticky promotion but not recovery after transient failure.
  • Contributing context (if known): Telegram transport fallback is useful for persistent IPv6/DNS issues, but needed a bounded recovery path.

Regression Test Plan

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/telegram/src/fetch.test.ts
  • Scenario the test should lock in: sticky IPv4 and pinned-IP fallback recover to primary after repeated successes, while a failed primary probe keeps fallback sticky.
  • Why this is the smallest reliable guardrail: it tests the transport state machine directly with mocked fetch dispatchers.
  • Existing test that already covers this (if any): existing sticky fallback tests covered promotion only.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram transport can recover from sticky IPv4/pinned-IP fallback without restarting the gateway after the primary path becomes healthy again.

Diagram

Before:
primary failure -> sticky fallback -> remains degraded until restart

After:
primary failure -> sticky fallback -> recovery probe -> primary restored when healthy

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 24.13.0
  • Model/provider: N/A
  • Integration/channel (if any): Telegram
  • Relevant config (redacted): default Telegram fetch transport with fallback enabled

Steps

  1. Simulate a transient primary Telegram fetch failure.
  2. Observe sticky fallback promotion to IPv4 or pinned-IP dispatcher.
  3. Let fallback requests succeed enough to trigger recovery probing.
  4. Simulate primary recovery.

Expected

  • Sticky fallback resets to primary after a successful primary recovery probe.

Actual

  • Before this fix, sticky fallback remained degraded until process restart.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)
pnpm test extensions/telegram/src/fetch.test.ts
pnpm test extensions/telegram/src/polling-transport-state.test.ts extensions/telegram/src/polling-session.test.ts
pnpm test extensions/telegram/src/send.test.ts
pnpm check:changed

Human Verification

What you personally verified (not just CI), and how:

  • Verified scenarios: IPv4 sticky fallback recovery, pinned-IP sticky fallback recovery, failed primary recovery probe retaining fallback.
  • Edge cases checked: caller-provided dispatchers do not advance recovery state; all-attempt failure still leaves armed fallback sticky.
  • What you did not verify: live overnight flaky-network saturation behind GFW.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: Primary transport could still be unhealthy when probed.
    • Mitigation: the probe is bounded to one normal request after repeated sticky successes; if primary fails, the same request falls back and sticky fallback remains active.

Real behavior proof

  • Behavior addressed: Telegram fetch sticky fallback now recovers from IPv4/pinned-IP fallback to the primary transport after repeated successful fallback requests, instead of staying degraded until gateway restart.
  • Real environment tested: Local OpenClaw checkout on Linux with Node 24.13.0, exercising the Telegram fetch transport state machine after this patch. No live Telegram bot token was used because the reported outage requires a flaky GFW-style egress path, but the runtime transport behavior is covered directly.
  • Exact steps or command run after this patch:
pnpm test extensions/telegram/src/fetch.test.ts
  • Evidence after fix:
extensions/telegram/src/fetch.test.ts passed locally on Node 24.13.0.
The after-fix output covered the sticky IPv4 fallback recovery path, the sticky pinned-IP fallback recovery path, and the failed-primary-probe path that keeps fallback sticky.

@openclaw-barnacle openclaw-barnacle Bot added channel: telegram Channel integration: telegram size: S labels May 4, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 4, 2026

Codex review: needs real behavior proof before merge.

Summary
The PR adds Telegram sticky fallback success tracking, a bounded primary recovery probe/demotion path, focused regression tests, and a changelog entry.

Reproducibility: yes. at source level: current main only promotes stickyAttemptIndex, starts later requests from the promoted dispatcher, and existing tests assert fallback reuse. I did not establish a live overnight flaky-egress reproduction path.

Real behavior proof
Needs real behavior proof before merge: The PR body supplies local mocked Vitest proof only; it still needs redacted non-mock terminal output, logs, screenshot, recording, or linked artifact from a real after-fix OpenClaw/Telegram runtime path. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Contributor proof is the remaining blocker; an automated repair cannot provide non-mock evidence from the contributor's runtime setup.

Security
Cleared: The diff only changes Telegram fetch state tracking, tests, and changelog text; it does not change dependencies, workflows, permissions, secrets handling, package resolution, or code execution paths.

Review details

Best possible solution:

Land the focused Telegram transport recovery after redacted real-runtime proof is added, leaving pool-size, fallback-IP, and config-knob tuning as separate decisions.

Do we have a high-confidence way to reproduce the issue?

Yes at source level: current main only promotes stickyAttemptIndex, starts later requests from the promoted dispatcher, and existing tests assert fallback reuse. I did not establish a live overnight flaky-egress reproduction path.

Is this the best way to solve the issue?

Yes for the code direction: a bounded primary recovery probe inside the Telegram plugin transport is narrower than adding config knobs or changing fallback IP policy. The PR is not merge-ready until non-mock after-fix runtime proof is supplied.

Acceptance criteria:

  • Contributor-supplied redacted real behavior proof from an after-fix OpenClaw/Telegram runtime path
  • pnpm test extensions/telegram/src/fetch.test.ts
  • pnpm test extensions/telegram/src/polling-transport-state.test.ts extensions/telegram/src/polling-session.test.ts
  • pnpm test extensions/telegram/src/send.test.ts
  • pnpm check:changed

What I checked:

Likely related people:

  • obviyus: Authored the merged change that unified Telegram API and media fetches under the sticky IPv4 and pinned-IP fallback chain that this PR updates. (role: introduced affected fallback chain; confidence: high; commits: e4825a0f9385; files: extensions/telegram/src/fetch.ts, extensions/telegram/src/fetch.test.ts)
  • steipete: Recent commits around Telegram startup fallback retries, managed proxy handling, fallback log levels, and Telegram docs touch the same transport surface. (role: recent maintainer and likely follow-up owner; confidence: high; commits: 74a667f119cf, dc9f1b8525b1, e873c1e1f815; files: extensions/telegram/src/fetch.ts, extensions/telegram/src/fetch.test.ts, docs/channels/telegram.md)

Remaining risk / open question:

  • The contributor-supplied proof is still mocked Vitest output, not a redacted live or real-runtime Telegram/OpenClaw run.
  • The full event-loop saturation symptom depends on flaky Telegram egress and was not live-reproduced, though the source-level sticky fallback root cause is clear.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4e983aa57b8b.

@MkDev11
Copy link
Copy Markdown
Contributor Author

MkDev11 commented May 4, 2026

@clawsweeper

@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@obviyus obviyus force-pushed the fix/issue-77088-telegram-sticky-recovery branch 2 times, most recently from bf029b9 to e78ef27 Compare May 8, 2026 03:33
@obviyus obviyus force-pushed the fix/issue-77088-telegram-sticky-recovery branch from e78ef27 to 1bf2801 Compare May 8, 2026 03:39
@obviyus obviyus merged commit 252456e into openclaw:main May 8, 2026
104 checks passed
@obviyus
Copy link
Copy Markdown
Contributor

obviyus commented May 8, 2026

Landed via rebase onto main.

  • Scoped tests: /Users/obviyus/Developer/openclaw/node_modules/.bin/oxfmt --check --threads=1 CHANGELOG.md extensions/telegram/src/fetch.ts extensions/telegram/src/fetch.test.ts; OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/tmp/openclaw-vitest-pr77157-land-fetch2 /Users/obviyus/Developer/openclaw/node_modules/.bin/vitest run extensions/telegram/src/fetch.test.ts
  • Changelog: CHANGELOG.md updated
  • Land commit: 1bf2801
  • Merge commit: 252456e

Thanks @MkDev11!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart

2 participants