What task are you trying to do?
We want PawWork agents to recover from tool failures instead of blindly retrying the same failing action. When a tool fails, the harness should make the failure layer clear enough for the model to choose the right next step: fix bad arguments, ask for permission, switch to a safer tool, stop after user cancellation, or report a provider/environment problem.
Which area would this change affect?
Model harness, prompts, tools, or session mechanics
What do you do today?
Loop diagnostics already record useful structure for repeated tool behavior, but ordinary tool failures often reach the model and export as a status plus an error string. That is enough for a human to inspect later in many cases, but it is not reliable enough for the agent to decide whether retrying makes sense or whether it should change strategy.
What would a good result look like?
Tool failures carry a small structured reason that is preserved in local session state and exports, and the model-facing failure result includes a short recovery hint. The first version should stay local and practical: no remote telemetry, no dashboard, no broad analytics platform.
Suggested initial categories:
invalid_arguments: the model called the tool with malformed or invalid input.
permission_denied: PawWork or the OS blocked the action.
environment: the target path, command, working directory, or local setup was not available as expected.
provider: an external model/API/search/provider failed.
timeout: the tool exceeded its allowed time.
user_aborted: the user canceled or interrupted the action.
unknown: the harness could not classify the failure yet.
What would count as done?
- Tool error metadata includes a structured failure reason such as
errorKind.
- Session export preserves the structured failure reason without uploading conversation or tool bodies remotely.
- The model-facing tool failure includes a short recovery hint matched to the reason.
- At least these paths are covered: invalid tool arguments, permission denial, user abort, timeout, provider/API failure, and environment failure.
- Unknown failures remain possible and are explicitly tagged as
unknown so future issues can improve classification.
- Existing loop diagnostics continue to work and are not replaced by this feature.
What should stay out of scope?
- Remote telemetry by default.
- A metrics dashboard or alerting system.
- A broad dynamic-context rewrite.
- Semantic interpretation of every command output.
- Changing user-facing copy beyond concise recovery hints needed for the model and export diagnostics.
Which audience does this matter to most?
Both
Extra context
This comes from comparing PawWork's current harness against Cursor's public harness writing. Cursor treats tool failures as a first-class harness quality signal, but PawWork should take the smaller version first: classify local tool failures so the agent can recover and the export can explain what happened.
Refs #195.
What task are you trying to do?
We want PawWork agents to recover from tool failures instead of blindly retrying the same failing action. When a tool fails, the harness should make the failure layer clear enough for the model to choose the right next step: fix bad arguments, ask for permission, switch to a safer tool, stop after user cancellation, or report a provider/environment problem.
Which area would this change affect?
Model harness, prompts, tools, or session mechanics
What do you do today?
Loop diagnostics already record useful structure for repeated tool behavior, but ordinary tool failures often reach the model and export as a status plus an error string. That is enough for a human to inspect later in many cases, but it is not reliable enough for the agent to decide whether retrying makes sense or whether it should change strategy.
What would a good result look like?
Tool failures carry a small structured reason that is preserved in local session state and exports, and the model-facing failure result includes a short recovery hint. The first version should stay local and practical: no remote telemetry, no dashboard, no broad analytics platform.
Suggested initial categories:
invalid_arguments: the model called the tool with malformed or invalid input.permission_denied: PawWork or the OS blocked the action.environment: the target path, command, working directory, or local setup was not available as expected.provider: an external model/API/search/provider failed.timeout: the tool exceeded its allowed time.user_aborted: the user canceled or interrupted the action.unknown: the harness could not classify the failure yet.What would count as done?
errorKind.unknownso future issues can improve classification.What should stay out of scope?
Which audience does this matter to most?
Both
Extra context
This comes from comparing PawWork's current harness against Cursor's public harness writing. Cursor treats tool failures as a first-class harness quality signal, but PawWork should take the smaller version first: classify local tool failures so the agent can recover and the export can explain what happened.
Refs #195.