Skip to content

Telegram isolated ingress timeout recovery misses lone active spooled handler without backlog #84158

@crash2kx

Description

@crash2kx

Summary

After the #83505 fix is present, I still observed a Telegram isolated-ingress .json.processing marker remain stuck for a single active topic update when no later same-lane update was queued behind it.

This looks like a remaining edge case in the timeout recovery trigger, not a duplicate of the original #83272 failure mode.

Related

Environment

  • OpenClaw: 2026.5.18
  • Local commit: 50a2481652
  • Install type: Docker
  • Channel: Telegram supergroup forum topics
  • Runtime: Codex app-server / embedded agent
  • Gateway state at inspection time: running and healthy after the turn eventually cleared

Observed behavior

During Telegram topic testing, a topic message caused prolonged main-thread CPU pressure and delayed health/Telegram behavior. After the system settled, the ingress spool still contained a .json.processing file for the topic update.

Important detail: there was not necessarily a later same-lane update behind that processing marker. The update could therefore remain a lone active handler rather than appearing in drain.blockedByLane.

Source-level concern

Current recovery appears to call timeout recovery using drain.blockedByLane as the candidate set. That catches the important case fixed by #83505, where a stuck handler blocks later same-lane updates.

But a single active stuck handler without a later same-lane update may not be included in blockedByLane, so #recoverTimedOutSpooledHandler(...) may not evaluate it for timeout recovery even after the handler timeout has elapsed.

Suggested narrow fix

Build the timeout candidate set from all active spooled handlers for the same spool, then union in drain.blockedByLane for compatibility:

const timeoutCandidateHandlerKeys = this.#activeSpooledUpdateHandlerKeysForSpool(spoolDir);
for (const handlerKey of drain.blockedByLane) {
  timeoutCandidateHandlerKeys.add(handlerKey);
}
const timedOutRecovery = await this.#recoverTimedOutSpooledHandler(timeoutCandidateHandlerKeys);

This preserves same-lane ordering and #83505's tombstone/restart behavior, but also lets a lone active processing claim time out.

Regression coverage idea

Add a polling-session test where:

  1. A single spooled topic update is claimed and handleUpdate never settles.
  2. No later same-lane update exists.
  3. spooledUpdateHandlerTimeoutMs elapses.
  4. The update is failed into a tombstone and isolated ingress restart is requested.

I prepared a small local patch sketch against extensions/telegram/src/polling-session.ts and extensions/telegram/src/polling-session.test.ts; git diff --check passes. I have not deployed that patch to the running gateway.

Why this matters

Without this edge-case recovery, a lone stuck .json.processing marker can make the account appear mostly recovered while leaving stale spool state behind. On small VPS installs this also correlates with user-visible Telegram delays and event-loop/CPU pressure during the stuck turn.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions