Skip to content

[Task] Track harness improvement series #195

@Astro-Han

Description

@Astro-Han

Why this series exists

PawWork's runtime is vendored from opencode and will stay that way until scale justifies deeper divergence. Where PawWork diverges permanently, the preferred pattern is not to fork the runtime but to adjust the harness layer: product system prompt, base tool descriptions, session mechanics, and loop or diagnostic observability. These adjustments share one direction: fewer surfaces that assume a developer audience, better defaults for weaker models, and clearer recovery when things go wrong.

What belongs in this series

A change belongs here when it touches any of:

  • packages/opencode/src/session/prompt/*.txt or nearby prompt-composition paths, including provider system prompts and bundled product instructions.
  • packages/opencode/src/tool/*.txt or other model-visible tool-description surfaces.
  • Session-level behavior such as plan approval, question routing, subagent wiring, loop detection, or diagnostics.
  • Global instruction injection, such as packages/opencode/src/session/instruction.ts and the bundled pawwork.txt loading path.
  • Base-tool exposure decisions, plugin boundaries, and other changes that alter what the model sees or prefers by default.

A change does not belong here when it only affects UI (ui), the Electron shell (desktop), or CI (ci), even if it touches agent behavior indirectly.

Series themes

1. System prompt unification and product instruction architecture

Goal: one PawWork-owned behavior surface across providers and models, with product rules living in the right layer instead of being scattered through model-family branches.

  • [Feature] Remove model-specific behavior prompts #130 Remove model-specific behavior prompts
  • Follow-up prompt-layer cleanups that keep unfamiliar-tool guidance and product behavior in the system or project instruction layer rather than duplicating workflow text inside generic tools.

2. Tool semantics and prompt optimization

Goal: make base tools easier for weaker models to use correctly, remove instructions that teach wrong behavior, and tighten the boundaries between nearby tools.

3. Tool surface reduction and boundary simplification

Goal: keep the default tool surface small and strong, and decide which capabilities should stay first-class versus move behind plugins or a narrower default surface.

  • [Feature] Move advanced low-frequency tools to plugins #131 Move advanced low-frequency tools to plugins — closed as not planned for now. The plugin-tool-registration path is too broad for the current harness series, and the base-tool surface should be revisited only with fresh evidence from real sessions.
  • Open direction question: should tools such as grep and glob remain standalone base tools, move behind plugins, or be progressively replaced by stronger Bash guidance and fewer default tools?

4. Web access and source safety defaults

Goal: search when the task needs current or external evidence, but treat fetched content as untrusted input and prefer high-quality sources.

5. Session control, approval, and recovery

Goal: let the model pause appropriately before risky work, ask better questions, and stop visible spinning when progress is low.

6. Export and replay observability

Goal: make local session exports useful enough to explain failures without adding remote telemetry or a dashboard.

Current known gaps

These are already implied by the themes above, but are called out here because they are easy to lose between issues.

  • bash.txt should not embed GitHub workflow. A generic terminal tool description should not carry commit, PR, or inline-review workflow tutorials as if every PawWork user were a frequent gh user.
  • Unfamiliar-command guidance should live above Bash. The preferred rule is: when the model is not confident about a CLI surface such as GitHub CLI, first check gh <command> --help instead of guessing API shape or arguments. This belongs in the system or project instruction layer, not as a large embedded workflow inside bash.txt.
  • Tool consolidation direction is still open, but not currently planned as plugin migration work. Moving tools behind plugins requires a broader plugin SDK tool-registration surface. Keep this as a design question until real sessions show the current base surface is the bottleneck.
  • High-friction workflow ergonomics may still need dedicated helpers. If repeated real sessions show that workflows like PR inline review remain error-prone even after prompt cleanup, we may need narrower helpers instead of expecting models to build everything from raw gh api usage.
  • Do not turn model intent mistakes into automatic harness repair. PawWork should make the failure layer clear, as in [Feature] Add structured tool failure reasons for agent recovery #439, but avoid guessing paths, commands, filenames, or other semantic intent on the model's behalf.

Active work

Maintenance

This issue stays open as a living index for the harness series. When filing new harness work:

  • apply label harness
  • add a task-list line here
  • place the new issue under one of the theme headings above, or add a new theme only if the gap is genuinely new

Task items auto-check when sub-issues close via GitHub native sub-issue linkage.

Precedent

Prior closed work worth citing as precedent:

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityharnessModel harness, prompts, tool descriptions, and session mechanicstaskNarrow execution, audit, spike, migration, tracking, or upstream follow-up work

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions