What task are you trying to do?
We want PawWork to recognize when an agent has moved from a legitimate retry or fallback into low-yield repeated probing of the same investigation target, then summarize the blocker and ask the user instead of continuing to spin.
What do you do today?
The current loop diagnostics added in PR #204 correctly detect repeated identical tool inputs and repeated identical tool error classes, and inject a one-time reminder after the third repeat. That worked for a recent session at the start: three webfetch calls hit the same GitHub Pages 404 and the error_repeat reminder was injected.
However, the model then escaped the current detector by changing tool family and slightly changing the command each time. In the same session, it switched into a long run of read-only bash probes against the same target, for example repeated curl plus different grep variants against https://datawhalechina.github.io/. These calls were technically successful and the command strings differed, so they no longer matched the current repeated-input or repeated-error heuristics, even though user-visible progress had effectively stalled.
What would a good result look like?
PawWork tracks investigation progress at the target level, not only at the exact command or exact error level. When the agent keeps probing the same page, domain, repo, or resource family with low information gain across multiple turns or tool families, the harness should treat that as suspected stuck behavior.
A good result should:
- distinguish legitimate exploration from low-yield probing on the same target
- survive tool-family changes such as
webfetch to bash when the underlying target is unchanged
- persist the stuck suspicion beyond a single one-shot reminder when behavior does not improve
- trigger a summarize-and-ask path when high-confidence stuck behavior is detected
- avoid forcing early interruption when the agent is still discovering genuinely new targets or evidence
Which audience does this matter to most?
Both
Extra context
A recent session showed the current boundary clearly: PR #204 did fire once on repeated webfetch 404s, so the problem is not that diagnostics were absent. The gap is that the model then switched to many slightly different bash reads on the same URL and continued for a long time until provider quota stopped the run.
This issue should stay practical and product-facing. It is not a request for a complex agent scheduler, broad model-specific patching, or a full harness rewrite. The smallest useful direction is likely some combination of:
- target-level investigation grouping
- low-information-progress signals for read-only probing
- a persistent suspected-stuck state instead of a one-time reminder only
- a stronger summarize-and-ask escalation path once the model has already ignored the earlier warning
Positive and negative examples
Positive example, should NOT be treated as stuck:
A GLM-5.1 session asked how https://datawhalechina.github.io/ was implemented. It saw two direct webfetch 404s on the GitHub Pages URL, then quickly pivoted to genuinely new targets and evidence: the GitHub organization page, the candidate Pages repo, the site headers, and finally https://www.datawhale.cn. After that it summarized the finding that the GitHub Pages URL was no longer the real site, explained the likely implementation options, asked one clarifying question, and finished normally. This is a good example of legitimate fallback and target expansion after an initial 404.
Negative example, SHOULD be treated as suspected stuck:
A Kimi K2.6 session on the same user task also hit repeated webfetch 404s, which correctly triggered the current PR #204 reminder. But after that reminder it escaped the detector by switching tool family and issuing a large number of read-only bash probes against the same underlying target, such as repeated curl plus different grep variants over the GitHub Pages 404 HTML. The exact command strings changed and many calls technically succeeded, but user-visible progress did not. This is the failure mode we want to catch.
Acceptance criteria
- The harness can recognize repeated low-yield probing on the same investigation target even when exact commands differ.
- Detection can span more than one tool family when the underlying target is the same.
- The design distinguishes low-yield probing from real progress that introduces new targets or meaningful new evidence.
- The user sees a summarize-and-ask outcome instead of a long silent spin once high-confidence stuck behavior is reached.
- The implementation remains lightweight, local-first, and explainable during session debugging.
What task are you trying to do?
We want PawWork to recognize when an agent has moved from a legitimate retry or fallback into low-yield repeated probing of the same investigation target, then summarize the blocker and ask the user instead of continuing to spin.
What do you do today?
The current loop diagnostics added in PR #204 correctly detect repeated identical tool inputs and repeated identical tool error classes, and inject a one-time reminder after the third repeat. That worked for a recent session at the start: three
webfetchcalls hit the same GitHub Pages 404 and theerror_repeatreminder was injected.However, the model then escaped the current detector by changing tool family and slightly changing the command each time. In the same session, it switched into a long run of read-only
bashprobes against the same target, for example repeatedcurlplus differentgrepvariants againsthttps://datawhalechina.github.io/. These calls were technically successful and the command strings differed, so they no longer matched the current repeated-input or repeated-error heuristics, even though user-visible progress had effectively stalled.What would a good result look like?
PawWork tracks investigation progress at the target level, not only at the exact command or exact error level. When the agent keeps probing the same page, domain, repo, or resource family with low information gain across multiple turns or tool families, the harness should treat that as suspected stuck behavior.
A good result should:
webfetchtobashwhen the underlying target is unchangedWhich audience does this matter to most?
Both
Extra context
A recent session showed the current boundary clearly: PR #204 did fire once on repeated
webfetch404s, so the problem is not that diagnostics were absent. The gap is that the model then switched to many slightly differentbashreads on the same URL and continued for a long time until provider quota stopped the run.This issue should stay practical and product-facing. It is not a request for a complex agent scheduler, broad model-specific patching, or a full harness rewrite. The smallest useful direction is likely some combination of:
Positive and negative examples
Positive example, should NOT be treated as stuck:
A
GLM-5.1session asked howhttps://datawhalechina.github.io/was implemented. It saw two directwebfetch404s on the GitHub Pages URL, then quickly pivoted to genuinely new targets and evidence: the GitHub organization page, the candidate Pages repo, the site headers, and finallyhttps://www.datawhale.cn. After that it summarized the finding that the GitHub Pages URL was no longer the real site, explained the likely implementation options, asked one clarifying question, and finished normally. This is a good example of legitimate fallback and target expansion after an initial 404.Negative example, SHOULD be treated as suspected stuck:
A
Kimi K2.6session on the same user task also hit repeatedwebfetch404s, which correctly triggered the current PR #204 reminder. But after that reminder it escaped the detector by switching tool family and issuing a large number of read-onlybashprobes against the same underlying target, such as repeatedcurlplus differentgrepvariants over the GitHub Pages 404 HTML. The exact command strings changed and many calls technically succeeded, but user-visible progress did not. This is the failure mode we want to catch.Acceptance criteria