Skip to content

[Bug] First-turn tool execution aborted without recorded source — model stops, user must resend #721

@Astro-Han

Description

@Astro-Han

What happened

The model emits a tool call (e.g., lark-doc skill, lark-cli docs +fetch bash command), tool execution starts, and within ~1 second the entire turn is interrupted. The tool part shows Tool execution aborted, the assistant message receives MessageAbortedError, and the UI shows the 已中断 divider. The model stops producing output and the user has to resend the same message.

This is the "first-turn interrupted" incident that PR #710 was opened to diagnose, but not fix.

Which area seems affected?

Model harness, prompts, tools, or session mechanics

How much does this affect you?

Breaks an important workflow (every interrupted turn loses partial model reasoning and partial tool input; the user must manually resend)

Steps to reproduce

  1. Open a session that has been idle for a while (fresh session, or after the previous turn finished and the user paused for ~10 minutes).
  2. Send a message that the model will respond to with an early tool call. Reproduced in the analyzed log with prompts that include a Feishu doc URL plus an image attachment, prompting lark-doc skill load or lark-cli docs +fetch.
  3. Observe: the model emits reasoning, then a tool call. ~1 second into tool execution, the turn is aborted with Tool execution aborted on the tool part and MessageAbortedError on the assistant message.
  4. Resending the same message usually succeeds.

What did you expect to happen?

The tool call completes normally and the model continues the turn. If a real cancel is needed, it should carry an explicit source so the cause can be identified.

PawWork version

Reproduced on a build prior to v2026.5.18 (which includes PR #710's diagnostic improvements). Need fresh logs from v2026.5.18+ to confirm which interrupt path is firing now.

OS version

Reproduced on macOS in the analyzed log. Platform coverage not yet confirmed.

Can you reproduce it again?

Sometimes (triggered on first turn after idle; both occurrences in the analyzed session match this pattern, on two different providers)

Diagnostics

Both aborts in the analyzed session (pawwork-session-witty-rocket-2026-05-18-03-40-31) share an identical signature:

  • Recorded abort diagnostic carries only propagation_point, error_name, error_message, recorded_at.
  • source, reason, mode, via_ctx_abort are all undefined — i.e., the interrupt did not reach onInterrupt through any explicit SessionPrompt.cancel / scope-finalizer path; the runner's interruptMeta Ref was still undefined at the point of resolution.
  • LLM stream trace shows stream_error: true and aborted: true together, with tool_input_start, tool_input_end, and tool_call all firing once, but tool_result and finish_step never firing.
  • Both occurrences happen on the first turn after an idle window. Reproduced across providers (deepseek-v4-flash, qwen3.6-plus), so this is not provider-specific.

Pre-PR-#710 code (the log was captured before that PR shipped) had three gaps that allowed source loss:

  • Runner.cancelWith(undefined) produced a snapshot of only {recordedAt} with no source/reason.
  • SessionRunState scope finalizer called runner.cancel (the zero-arg form) with no provenance.
  • resolveInterrupt returned whatever was in the Ref without filling a default source.

Context: what we already shipped

PR #710 (fix: preserve abort interrupt provenance, merged 2026-05-18) added provenance labels to the runner cancel-without-meta and scope-close paths, and let renderer session.action.abort diagnostics through the desktop sanitizer. That commit body explicitly notes:

This is diagnostic-only. It does not change when runs are cancelled.
Full prompt_async lifecycle tracing and broader scope ownership redesign remain deferred follow-up work.

This issue tracks the actual behavior fix.

Next step to unblock the fix

Collect one more session log after a user reproduces this on a build that includes PR #710 (v2026.5.18+). The new abort diagnostic will name which path fired:

  • session.run_state.scope — scope-driven fiber interrupt without an explicit cancel call.
  • session.run_state.finalizer — scope finalizer ran (instance/runtime shutting down during the turn).
  • runner.cancel_without_meta / runner.interrupt_without_meta — direct runner-level call/interrupt with no caller-supplied source.
  • session.prompt.cancel — explicit user cancel (would point to a different bug, e.g. unintended auto-cancel from the renderer).

Once the path is known, fix candidates:

  1. If session.run_state.scope / session.run_state.finalizer: find what is closing the SessionRunState scope during an active turn (likely instance lifecycle, directory switch, or runtime restart). Either stop closing it, or drain in-flight turns first.
  2. If runner.cancel_without_meta: find the caller passing no meta, give it provenance, then decide whether the cancel is intentional.

Workaround for users

Resend the same message. In the analyzed log, the same prompt succeeded on the resend.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High prioritybugSomething isn't workingharnessModel harness, prompts, tool descriptions, and session mechanics

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions