Skip to content

Slack long-running runs can appear silent when media delivery partially fails and recovery refuses replay #83165

@tianxiaochannel-oss88

Description

@tianxiaochannel-oss88

Bug report draft: Slack long-running runs can appear silent after partial media delivery failure

Title

Slack long-running runs can appear silent when media delivery partially fails and recovery refuses replay

Version / environment

  • OpenClaw: 2026.5.12 (f066dd2)
  • Runtime: macOS / LaunchAgent gateway
  • Channel: Slack Socket Mode
  • Model route: OpenAI Responses-compatible provider, gpt-5.5
  • Context window observed: 200k
  • Relevant config:
    • agents.defaults.contextInjection = "continuation-skip"
    • agents.defaults.compaction.mode = "safeguard"
    • agents.defaults.compaction.reserveTokensFloor = 40000
    • agents.defaults.compaction.timeoutSeconds = 240
    • agents.defaults.timeoutSeconds = 900
    • Slack status reactions enabled

Summary

Some Slack long-running sessions can look like they silently stop responding. The underlying work may have completed or partially completed, but final Slack delivery involving media can enter a send_attempt_started / partial delivery failure (bestEffort) state. After a gateway restart, delivery recovery refuses blind replay for safety, leaving no visible user-facing failure/recovery message in the Slack thread.

Separately, status/progress visibility can disappear during these failures, making it hard for the user to know whether the run is still active, aborted, or only failed during delivery.

Observed evidence

During one evening of usage, logs showed multiple reliability/observability symptoms:

  • 3 failed delivery queue entries from the same day with:
    • recoveryState: "send_attempt_started"
    • lastError: "partial delivery failure (bestEffort)"
    • payloads were Slack text plus local media attachment(s)
  • On gateway restart, recovery logged:
    • Found 3 pending delivery entries — starting recovery
    • delivery state is send_attempt_started; refusing blind replay without adapter reconciliation
    • Delivery recovery complete: 0 recovered, 3 failed, 0 skipped (max retries), 0 deferred (backoff)
  • Gateway restart happened while work was active:
    • draining 2 active task(s) and 1 active embedded run(s) before restart with timeout 300000ms
  • Additional logs around the same period included:
    • [responses] ... message=Request was aborted
    • fetch timeout reached; aborting operation
    • upstream internal_server_error / HTTP 502 from image/model providers
    • [timeout-compaction] compaction did not reduce context ... falling through to normal handling
    • [pi] discarded invalid tool result middleware output for message
    • Tool output unavailable due to post-processing error
    • long-running session ... queued_behind_active_work ... activeWorkKind=model_call ... recovery=none

Actual behavior

From the Slack user perspective:

  • A long-running task may appear to stop responding.
  • Progress/status indicators may no longer show useful state.
  • If final delivery partially fails, the user may not see a final success, a final error, or a recovery notice.
  • After restart, recovery refuses blind replay, which is understandable, but the user is not clearly informed that delivery was left unresolved.

Expected behavior

OpenClaw should make these failure modes visible and recoverable without risking duplicate spam:

  1. If Slack delivery enters send_attempt_started and cannot be safely replayed, post or queue a small fallback notice such as:
    • “A previous reply may have partially failed during Slack delivery. It was not replayed automatically to avoid duplicates. Run a recovery command or inspect delivery queue.”
  2. Provide adapter reconciliation for Slack deliveries where possible:
    • check whether a message/file was actually posted before refusing replay permanently;
    • if ambiguous, expose a clear manual recovery action.
  3. Keep status/progress visible for long-running sessions even if the final delivery path fails.
  4. Treat post-processing / invalid tool result middleware errors as observable diagnostic events, not silent status loss.
  5. For media delivery, consider a safer two-phase pattern:
    • send text/status first;
    • upload media second;
    • if media upload fails, leave the text reply visible with a retry/recovery hint.

Why this matters

The current behavior makes completed or partially completed runs indistinguishable from hung runs. This is especially painful for long-running image/video workflows where the final payload often includes media and the user depends on Slack progress/status to know whether to wait, retry, or inspect logs.

Privacy note

This report intentionally omits local paths, channel IDs, user IDs, tokens, and media names. Full local logs can be provided privately if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions