[Bug]: GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions (2026.5.27)

## Bug type

Crash (process/app hangs)

## Beta release blocker

No

## Summary

In v2026.5.27, isolated `agentTurn` cron runs using `github-copilot/claude-sonnet-4.6` (and likely any github-copilot model on the `anthropic-messages` transport) silently hang for ~365s with **zero progress events**, then abort internally with `promptError: "This operation was aborted"`. The cron run is then **misclassified as `status: ok`** in the cron history (despite the trajectory's `finalStatus: error`), suppressing all failure alerts — the cron appears healthy while delivering nothing.

This is the github-copilot variant of the watchdog-hang pattern fixed for claude-cli in #86895 / PR #87546. The fix in #87546 only added stream-json progress tracking for the claude-cli backend; the `anthropic-messages` transport used by github-copilot's claude models has no equivalent progress signal, so the gateway watchdog cannot distinguish a working-but-silent stream from a wedged one.

## Steps to reproduce

1. Configure an isolated `agentTurn` cron with `model: "github-copilot/claude-sonnet-4.6"`, no fallbacks, `timeoutSeconds: 600`, simple bash-then-message-tool prompt.
2. Trigger the cron (manual or scheduled).
3. Observe: trajectory file freezes at `prompt.submitted` for ~365s, then `model.completed` fires with `aborted: true`. No tool calls are made. Discord delivery never happens.
4. Cron history records `status: ok`, `delivered: false`, `deliveryStatus: not-requested`.

Reproduces deterministically across all my morning runs (5/5 attempts at 5:00, 5:15, 5:30, 5:45, 6:00 AM PT today, plus an additional manual run at 6:28 AM PT, all hung identically with durations 366,400–372,640 ms).

## Expected behavior

Either:
- The github-copilot anthropic-messages transport emits stream/heartbeat progress so the gateway's stuck-session watchdog can distinguish active streams from hung ones, OR
- A short transport-level idle timeout (e.g., 30–60s with no first token) triggers a transport error that the failover chain can catch, OR
- The cron runner correctly records `status: error` when the trajectory's `finalStatus` is `error`, so existing failure-alert plumbing fires.

## Actual behavior

Run hangs for ~6 min with no progress; aborts internally; cron records "success"; nothing delivered; no alerts.

### Trajectory artifact (`trace.artifacts.data`)

```json
{
  "finalStatus": "error",
  "aborted": true,
  "externalAbort": false,
  "timedOut": false,
  "idleTimedOut": false,
  "timedOutDuringCompaction": false,
  "timedOutDuringToolExecution": false,
  "promptError": "This operation was aborted",
  "promptErrorSource": "prompt",
  "compactionCount": 0,
  "assistantTexts": [],
  "itemLifecycle": { "startedCount": 0, "completedCount": 0, "activeCount": 0 },
  "toolMetas": [],
  "didSendViaMessagingTool": false,
  "messagingToolSentTexts": [],
  "messagingToolSentTargets": []
}
```

### Cron run history (same run, recorded by cron)

```json
{
  "action": "finished",
  "status": "ok",
  "runAtMs": 1779973200025,
  "durationMs": 372636,
  "provider": "github-copilot",
  "model": "claude-sonnet-4.6",
  "deliveryStatus": "not-requested",
  "delivery": { "delivered": false, "fallbackUsed": false }
}
```

Note the contradiction: `status: ok` and `delivered: false` on the cron-side, vs `finalStatus: error` and `aborted: true` on the trajectory-side.

## Environment

- OpenClaw: **2026.5.27**
- Gateway runtime: Node v25.9.0 invoking NVM-installed openclaw at `~/.nvm/versions/node/v26.2.0/lib/node_modules/openclaw/dist/index.js`
- OS: macOS 26.5 (Darwin 25.5.0 arm64)
- Provider: `github-copilot` (token exchanged via `proxy.business.githubcopilot.com`, enterprise SKU `copilot_for_business_seat_quota`, freshly refreshed mid-failure)
- Model: `github-copilot/claude-sonnet-4.6` via `anthropic-messages` API
- Session target: `isolated`
- Affected cron payload: simple `bash + message(action=send) + message(action=upload-file)`
- Auth is healthy throughout (token refresh succeeded; gh auth status reports valid token with all required scopes)

## Related issues / context

- **#86895** (closed today, 2026-05-28) — identical 365s watchdog hang pattern on claude-cli + webchat. PR #87546 (commit 4c3a0292) added stream-json progress tracking for claude-cli only; the github-copilot anthropic-messages transport was not covered. **This issue is the github-copilot half of that bug.**
- **#87327** (open, filed 2026-05-27) — similar isolated-cron stall but in the earlier `runtime-plugins` phase (78–80s, before model invocation). Distinct stall point, same overall family.
- **#65576** (closed 2026-04-25) — historical: cron runs silently disabled the LLM idle watchdog. Fixed, but with no per-attempt idle timeout for streams that open and then go silent, the same surface re-emerges in a different way.
- **#75332** (closed 2026-05-01) — HTTP 429 from github-copilot misclassified as idle timeout. Different trigger, same misclassification-as-success pattern.

## Mitigation in use

The straightforward fix — adding `["github-copilot/gpt-5.5", "github-copilot/claude-opus-4.6"]` as `payload.fallbacks` — **does not work**. I confirmed this with a controlled retry: cron history shows `fallbackUsed: false`, `delivered: false`, `status: ok` after a fresh 374-second hang. The fallback chain is loaded but never triggers.

The reason is that pi-ai's failover classifier keys on specific patterns (the documented "An unknown error occurred" stream-wrapper text, HTTP 429, transport errors, etc.). The internal `AbortError: "This operation was aborted"` from the gateway's stuck-session watchdog does not match any of those patterns, so failover is silently inert for this exact failure mode.

Working mitigation: change the **primary** to a model on a different transport (`github-copilot/gpt-5.5` via `openai-responses`). That bypasses the broken `anthropic-messages` code path entirely. Configured fallbacks back to sonnet/opus are still inert if gpt-5.5 ever hits the same class of issue.

## Three distinct bugs in this one symptom

1. **Primary bug:** `github-copilot` + `anthropic-messages` transport opens a stream, never emits a chunk, and is aborted by the internal watchdog ~365s later. (Sibling of #86895, fix in #87546 only covered claude-cli.)
2. **Failover-classifier gap:** internal `AbortError` from the stuck-session watchdog is not on pi-ai's failover-worthy error list. Configured fallback chains do not fire for this failure mode. The classifier should probably treat `aborted: true` + `assistantTexts: []` + `toolMetas: []` as a failover-worthy outcome regardless of the specific error string.
3. **Cron-runner status misclassification:** when the underlying trajectory ends with `finalStatus: error` and `aborted: true`, the cron runner still records `status: ok` in the run history. This means `failureAlert` blocks don't fire, `consecutiveErrors` stays at zero, the daily Cron Health Monitor reports the cron as healthy, and the user has no signal that anything is wrong — the cron silently dies for days while reporting green. Suggest the cron runner inspect `trace.artifacts.data.finalStatus` (or equivalently `model.completed.data.aborted` combined with empty `assistantTexts`) to set the correct status, and report `deliveryStatus: failed` when `didSendViaMessagingTool: false` on a payload whose prompt requires delivery.

Bugs 2 and 3 are likely worth separate issues, but I've kept them together here because they compose into the user-facing symptom: a cron that hangs, never delivers, never falls back, and never alerts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions (2026.5.27) #87692

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

Trajectory artifact (`trace.artifacts.data`)

Cron run history (same run, recorded by cron)

Environment

Related issues / context

Mitigation in use

Three distinct bugs in this one symptom

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions (2026.5.27) #87692

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

Trajectory artifact (trace.artifacts.data)

Cron run history (same run, recorded by cron)

Environment

Related issues / context

Mitigation in use

Three distinct bugs in this one symptom

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Trajectory artifact (`trace.artifacts.data`)