[Task] Track harness improvement series

## Why this series exists

PawWork's runtime is vendored from opencode and will stay that way until scale justifies deeper divergence. Where PawWork diverges permanently, the preferred pattern is not to fork the runtime but to adjust the harness layer: product system prompt, base tool descriptions, session mechanics, and loop or diagnostic observability. These adjustments share one direction: fewer surfaces that assume a developer audience, better defaults for weaker models, and clearer recovery when things go wrong.

## What belongs in this series

A change belongs here when it touches any of:

- `packages/opencode/src/session/prompt/*.txt` or nearby prompt-composition paths, including provider system prompts and bundled product instructions.
- `packages/opencode/src/tool/*.txt` or other model-visible tool-description surfaces.
- Session-level behavior such as plan approval, question routing, subagent wiring, loop detection, or diagnostics.
- Global instruction injection, such as `packages/opencode/src/session/instruction.ts` and the bundled `pawwork.txt` loading path.
- Base-tool exposure decisions, plugin boundaries, and other changes that alter what the model sees or prefers by default.

A change does not belong here when it only affects UI (`ui`), the Electron shell (`desktop`), or CI (`ci`), even if it touches agent behavior indirectly.

## Series themes

### 1. System prompt unification and product instruction architecture

Goal: one PawWork-owned behavior surface across providers and models, with product rules living in the right layer instead of being scattered through model-family branches.

- [x] #130 Remove model-specific behavior prompts
- [ ] Follow-up prompt-layer cleanups that keep unfamiliar-tool guidance and product behavior in the system or project instruction layer rather than duplicating workflow text inside generic tools.

### 2. Tool semantics and prompt optimization

Goal: make base tools easier for weaker models to use correctly, remove instructions that teach wrong behavior, and tighten the boundaries between nearby tools.

- [x] #128 Rename and rewrite task as subagent
- [x] #129 Rewrite base tool descriptions under the three-question rule
- [x] #188 Improve question tool clarity and proactive use

### 3. Tool surface reduction and boundary simplification

Goal: keep the default tool surface small and strong, and decide which capabilities should stay first-class versus move behind plugins or a narrower default surface.

- [x] #131 Move advanced low-frequency tools to plugins — closed as not planned for now. The plugin-tool-registration path is too broad for the current harness series, and the base-tool surface should be revisited only with fresh evidence from real sessions.
- [ ] Open direction question: should tools such as `grep` and `glob` remain standalone base tools, move behind plugins, or be progressively replaced by stronger Bash guidance and fewer default tools?

### 4. Web access and source safety defaults

Goal: search when the task needs current or external evidence, but treat fetched content as untrusted input and prefer high-quality sources.

- [x] #132 Define default web access and source safety strategy

### 5. Session control, approval, and recovery

Goal: let the model pause appropriately before risky work, ask better questions, and stop visible spinning when progress is low.

- [ ] #127 Replace visible Plan Mode with lightweight plan approval tool
- [x] #133 Add lightweight loop observation and session diagnostics
- [x] #279 Loop gate also covers low-yield successful repeats, not only failures
- [x] #439 Add structured tool failure reasons for agent recovery

### 6. Export and replay observability

Goal: make local session exports useful enough to explain failures without adding remote telemetry or a dashboard.

- [x] #214 Add LLM stream diagnostics to local session export
- [x] #267 Hide synthetic stop tool part in UI when loopAction is `stop`
- [ ] #808 Define Run Incident Framework for interrupted runs
  - [ ] #802 Add lifecycle causality diagnostics for interrupted runs
  - [ ] #803 Refine run diagnostics for interrupted streaming tool calls
  - [ ] #804 Add safe recovery for interrupted streaming runs
  - [ ] #755 30s LLM connect timeout aborts OpenAI reasoning streams

## Current known gaps

These are already implied by the themes above, but are called out here because they are easy to lose between issues.

- **`bash.txt` should not embed GitHub workflow.** A generic terminal tool description should not carry commit, PR, or inline-review workflow tutorials as if every PawWork user were a frequent `gh` user.
- **Unfamiliar-command guidance should live above Bash.** The preferred rule is: when the model is not confident about a CLI surface such as GitHub CLI, first check `gh <command> --help` instead of guessing API shape or arguments. This belongs in the system or project instruction layer, not as a large embedded workflow inside `bash.txt`.
- **Tool consolidation direction is still open, but not currently planned as plugin migration work.** Moving tools behind plugins requires a broader plugin SDK tool-registration surface. Keep this as a design question until real sessions show the current base surface is the bottleneck.
- **High-friction workflow ergonomics may still need dedicated helpers.** If repeated real sessions show that workflows like PR inline review remain error-prone even after prompt cleanup, we may need narrower helpers instead of expecting models to build everything from raw `gh api` usage.
- **Do not turn model intent mistakes into automatic harness repair.** PawWork should make the failure layer clear, as in #439, but avoid guessing paths, commands, filenames, or other semantic intent on the model's behalf.

## Active work

- [ ] #127 Replace visible Plan Mode with lightweight plan approval tool

## Maintenance

This issue stays open as a living index for the harness series. When filing new harness work:

- apply label `harness`
- add a task-list line here
- place the new issue under one of the theme headings above, or add a new theme only if the gap is genuinely new

Task items auto-check when sub-issues close via GitHub native sub-issue linkage.

## Precedent

Prior closed work worth citing as precedent:

- #130 Remove model-specific behavior prompts, a completed prompt-layer simplification that reduced model-family behavior drift without forking runtime internals.
- #128 Rename and rewrite task as subagent, clarifying subagent semantics without expanding the default tool surface.
- #129 Rewrite base tool descriptions under the three-question rule, tightening base-tool wording and deletion guardrails.
- #132 Define default web access and source safety strategy, making current/external evidence and untrusted fetched content explicit.
- #133 Add lightweight loop observation and session diagnostics, giving repeated failures a local diagnostic path.
- #188 Improve question tool clarity and proactive use, making clarification a bounded recovery path rather than a mode switch.
- #279 Loop gate also covers low-yield successful repeats, extending loop handling beyond hard failures.
- #439 Add structured tool failure reasons for agent recovery, making ordinary tool failures model-readable without automatic semantic repair.
- #788 Add run observability diagnostics, the immediate run-facts foundation for #808.
- #794 Trace lifecycle close provenance, the lifecycle-provenance foundation for #802 and #808.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task] Track harness improvement series #195

Why this series exists

What belongs in this series

Series themes

1. System prompt unification and product instruction architecture

2. Tool semantics and prompt optimization

3. Tool surface reduction and boundary simplification

4. Web access and source safety defaults

5. Session control, approval, and recovery

6. Export and replay observability

Current known gaps

Active work

Maintenance

Precedent

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Task] Track harness improvement series #195

Description

Why this series exists

What belongs in this series

Series themes

1. System prompt unification and product instruction architecture

2. Tool semantics and prompt optimization

3. Tool surface reduction and boundary simplification

4. Web access and source safety defaults

5. Session control, approval, and recovery

6. Export and replay observability

Current known gaps

Active work

Maintenance

Precedent

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions