[Feature] Define Run Incident Framework for interrupted runs

## Goal

Define a unified Run Incident Framework for assistant runs that end unexpectedly. This is the umbrella direction for #802, #803, and #804: do not fix each observed interruption as an isolated patch; build one shared language for facts, cause, phase, recovery policy, and user-facing presentation.

## Why this exists

Recent terminated-session exports showed that PawWork can collect many useful low-level signals, but the signals are still too ad hoc:

- #802 tracks local lifecycle-close causality: if PawWork locally reloads/disposes an active run, the export should say who initiated it.
- #803 tracks provider/transport interruption classification: if OpenAI streaming disconnects while tool input is still being generated, it must not be labeled as a tool execution failure.
- #804 tracks recovery UX: once the cause and safety facts are known, the session should offer the safest next step instead of feeling dead.

The common problem is not just one missing field. The common problem is that PawWork needs a unified run-incident model.

## Direction

A run incident should be split into five layers:

1. Facts — what actually happened.
2. Cause — why the run ended.
3. Phase — where the run was interrupted.
4. Policy — what recovery action is safe.
5. Presentation — what export reviewers and users should see.

The first design draft is recorded in the first comment on this issue.

## Related work

Direct implementation issues:
- #802 — local lifecycle causality diagnostics.
- #803 — provider/transport streaming interruption diagnostics.
- #804 — safe recovery and retry/continue UX.

Related active or residual reliability work:
- #755 — OpenAI connect-timeout behavior.

Foundational completed work:
- #788 — run observability diagnostics.
- #794 — lifecycle close provenance.
- #214 — LLM stream diagnostics in local session export.
- #133 — lightweight loop observation and session diagnostics.
- #439 — structured tool failure reasons for agent recovery.

Series index:
- #195 — harness improvement series.

## Scope

In scope:
- Establish the shared model and naming for run incidents.
- Keep #802/#803/#804 aligned under this model.
- Require future interruption diagnostics to add facts/cause/phase/policy/presentation rather than one-off fields.
- Preserve privacy and redaction constraints.

Out of scope:
- One giant implementation PR.
- Provider SDK/network fixes.
- Telemetry sink integration.
- Broad UI redesign.

## Execution mode

Investigate and get the design plan approved first. Here, "plan" means the issue-level design / scope proposal, not a PR-level implementation checklist. Once the approved design exists, agents may proceed with implementation plans inside the agreed scope; post a new issue comment and wait for explicit "approved" only when the implementation would change that design scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Define Run Incident Framework for interrupted runs #808

Goal

Why this exists

Direction

Related work

Scope

Execution mode

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Define Run Incident Framework for interrupted runs #808

Description

Goal

Why this exists

Direction

Related work

Scope

Execution mode

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions