What task are you trying to do?
We want PawWork to make stuck agent behavior diagnosable and less visible to users without building a complex routing or risk-level system.
What do you do today?
The current loop protection only catches a narrow repeated-tool pattern. When a model gets stuck, we may not have enough structured local evidence to understand whether the issue came from the model, the harness, a tool error, output truncation, or a bad prompt.
What would a good result look like?
PawWork records lightweight local diagnostics for session behavior, such as repeated tool input hashes, tool errors, refusal or block events, step count, elapsed time, output truncation, and approval loops. High-confidence stuck behavior leads the model to summarize what happened and ask the user before continuing. Long-running tasks are allowed to continue when they show real progress.
Which audience does this matter to most?
Both
Extra context
This issue should stay observational and practical. It is not a complex agent scheduler, not a five-level risk system, and not remote telemetry by default.
Acceptance criteria
- The harness records lightweight local diagnostic metadata useful for debugging stuck sessions.
- Diagnostic records avoid remote upload of conversation or tool body content by default.
- The design distinguishes true loops from long tasks that keep making progress.
- High-confidence stuck behavior triggers a summarize-and-ask path instead of silent spinning.
- The existing Trash protection remains the main destructive-risk mitigation.
What task are you trying to do?
We want PawWork to make stuck agent behavior diagnosable and less visible to users without building a complex routing or risk-level system.
What do you do today?
The current loop protection only catches a narrow repeated-tool pattern. When a model gets stuck, we may not have enough structured local evidence to understand whether the issue came from the model, the harness, a tool error, output truncation, or a bad prompt.
What would a good result look like?
PawWork records lightweight local diagnostics for session behavior, such as repeated tool input hashes, tool errors, refusal or block events, step count, elapsed time, output truncation, and approval loops. High-confidence stuck behavior leads the model to summarize what happened and ask the user before continuing. Long-running tasks are allowed to continue when they show real progress.
Which audience does this matter to most?
Both
Extra context
This issue should stay observational and practical. It is not a complex agent scheduler, not a five-level risk system, and not remote telemetry by default.
Acceptance criteria