Skip to content

[Bug]: GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions (2026.5.27) #87692

@RyanSandoval

Description

@RyanSandoval

Bug type

Crash (process/app hangs)

Beta release blocker

No

Summary

In v2026.5.27, isolated agentTurn cron runs using github-copilot/claude-sonnet-4.6 (and likely any github-copilot model on the anthropic-messages transport) silently hang for ~365s with zero progress events, then abort internally with promptError: "This operation was aborted". The cron run is then misclassified as status: ok in the cron history (despite the trajectory's finalStatus: error), suppressing all failure alerts — the cron appears healthy while delivering nothing.

This is the github-copilot variant of the watchdog-hang pattern fixed for claude-cli in #86895 / PR #87546. The fix in #87546 only added stream-json progress tracking for the claude-cli backend; the anthropic-messages transport used by github-copilot's claude models has no equivalent progress signal, so the gateway watchdog cannot distinguish a working-but-silent stream from a wedged one.

Steps to reproduce

  1. Configure an isolated agentTurn cron with model: "github-copilot/claude-sonnet-4.6", no fallbacks, timeoutSeconds: 600, simple bash-then-message-tool prompt.
  2. Trigger the cron (manual or scheduled).
  3. Observe: trajectory file freezes at prompt.submitted for ~365s, then model.completed fires with aborted: true. No tool calls are made. Discord delivery never happens.
  4. Cron history records status: ok, delivered: false, deliveryStatus: not-requested.

Reproduces deterministically across all my morning runs (5/5 attempts at 5:00, 5:15, 5:30, 5:45, 6:00 AM PT today, plus an additional manual run at 6:28 AM PT, all hung identically with durations 366,400–372,640 ms).

Expected behavior

Either:

  • The github-copilot anthropic-messages transport emits stream/heartbeat progress so the gateway's stuck-session watchdog can distinguish active streams from hung ones, OR
  • A short transport-level idle timeout (e.g., 30–60s with no first token) triggers a transport error that the failover chain can catch, OR
  • The cron runner correctly records status: error when the trajectory's finalStatus is error, so existing failure-alert plumbing fires.

Actual behavior

Run hangs for ~6 min with no progress; aborts internally; cron records "success"; nothing delivered; no alerts.

Trajectory artifact (trace.artifacts.data)

{
  "finalStatus": "error",
  "aborted": true,
  "externalAbort": false,
  "timedOut": false,
  "idleTimedOut": false,
  "timedOutDuringCompaction": false,
  "timedOutDuringToolExecution": false,
  "promptError": "This operation was aborted",
  "promptErrorSource": "prompt",
  "compactionCount": 0,
  "assistantTexts": [],
  "itemLifecycle": { "startedCount": 0, "completedCount": 0, "activeCount": 0 },
  "toolMetas": [],
  "didSendViaMessagingTool": false,
  "messagingToolSentTexts": [],
  "messagingToolSentTargets": []
}

Cron run history (same run, recorded by cron)

{
  "action": "finished",
  "status": "ok",
  "runAtMs": 1779973200025,
  "durationMs": 372636,
  "provider": "github-copilot",
  "model": "claude-sonnet-4.6",
  "deliveryStatus": "not-requested",
  "delivery": { "delivered": false, "fallbackUsed": false }
}

Note the contradiction: status: ok and delivered: false on the cron-side, vs finalStatus: error and aborted: true on the trajectory-side.

Environment

  • OpenClaw: 2026.5.27
  • Gateway runtime: Node v25.9.0 invoking NVM-installed openclaw at ~/.nvm/versions/node/v26.2.0/lib/node_modules/openclaw/dist/index.js
  • OS: macOS 26.5 (Darwin 25.5.0 arm64)
  • Provider: github-copilot (token exchanged via proxy.business.githubcopilot.com, enterprise SKU copilot_for_business_seat_quota, freshly refreshed mid-failure)
  • Model: github-copilot/claude-sonnet-4.6 via anthropic-messages API
  • Session target: isolated
  • Affected cron payload: simple bash + message(action=send) + message(action=upload-file)
  • Auth is healthy throughout (token refresh succeeded; gh auth status reports valid token with all required scopes)

Related issues / context

Mitigation in use

The straightforward fix — adding ["github-copilot/gpt-5.5", "github-copilot/claude-opus-4.6"] as payload.fallbacksdoes not work. I confirmed this with a controlled retry: cron history shows fallbackUsed: false, delivered: false, status: ok after a fresh 374-second hang. The fallback chain is loaded but never triggers.

The reason is that pi-ai's failover classifier keys on specific patterns (the documented "An unknown error occurred" stream-wrapper text, HTTP 429, transport errors, etc.). The internal AbortError: "This operation was aborted" from the gateway's stuck-session watchdog does not match any of those patterns, so failover is silently inert for this exact failure mode.

Working mitigation: change the primary to a model on a different transport (github-copilot/gpt-5.5 via openai-responses). That bypasses the broken anthropic-messages code path entirely. Configured fallbacks back to sonnet/opus are still inert if gpt-5.5 ever hits the same class of issue.

Three distinct bugs in this one symptom

  1. Primary bug: github-copilot + anthropic-messages transport opens a stream, never emits a chunk, and is aborted by the internal watchdog ~365s later. (Sibling of Webchat: substantive tool turns intermittently hang in post-tool-result generation → ~365s abort_embedded_run while the host event loop stays healthy, and the turn is silently lost — v2026.5.22 (a374c3a) #86895, fix in Fix Claude live tool progress for watchdog recovery #87546 only covered claude-cli.)
  2. Failover-classifier gap: internal AbortError from the stuck-session watchdog is not on pi-ai's failover-worthy error list. Configured fallback chains do not fire for this failure mode. The classifier should probably treat aborted: true + assistantTexts: [] + toolMetas: [] as a failover-worthy outcome regardless of the specific error string.
  3. Cron-runner status misclassification: when the underlying trajectory ends with finalStatus: error and aborted: true, the cron runner still records status: ok in the run history. This means failureAlert blocks don't fire, consecutiveErrors stays at zero, the daily Cron Health Monitor reports the cron as healthy, and the user has no signal that anything is wrong — the cron silently dies for days while reporting green. Suggest the cron runner inspect trace.artifacts.data.finalStatus (or equivalently model.completed.data.aborted combined with empty assistantTexts) to set the correct status, and report deliveryStatus: failed when didSendViaMessagingTool: false on a payload whose prompt requires delivery.

Bugs 2 and 3 are likely worth separate issues, but I've kept them together here because they compose into the user-facing symptom: a cron that hangs, never delivers, never falls back, and never alerts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions