fix(agent): feed prettified zod issues + sent params back to LLM on retry#34
Merged
Merged
Conversation
…etry
When `_validateAndNormalizeActions` rejected an action's params via
`actionInfo.paramSchema.safeParse(rawParams)`, the thrown error message
was the raw `paramsResult.error.message` — i.e. zod v4's default JSON
dump of the `issues` array (`[{"expected":"number","code":"invalid_type",
"path":["num_pages"],"message":"Invalid input: expected number, received
boolean"}]`). This noisy blob did flow into `state.last_result` and into
the next `create_state_messages` turn, but it was hard for the model to
parse and gave no corrective hint, so the model retried with the same
mistake until `max_failures=3` tripped.
Use `z.prettifyError(paramsResult.error)` (zod v4 native) to render
issues as readable lines (e.g. `✖ Invalid input: expected number,
received boolean → at num_pages`), include the offending params verbatim
so the model can diff against the schema, and tag the message with an
explicit `Schema validation failed` prefix plus a `Please retry with
parameters matching the action's schema exactly` instruction.
The existing pipeline does the rest: thrown Error → `_handle_step_error`
→ `state.last_result = [ActionResult({error: ...})]` → next step's
`create_state_messages` injects it into the LLM context. No new
injection mechanism, just a better-shaped payload going through the
existing one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
unadlib
approved these changes
May 7, 2026
caffeinum
added a commit
to caffeinum/browser-use
that referenced
this pull request
May 7, 2026
…d throw on exhaust Replaces the prior lenientBool default-false approach which silently masked bu-2-0's tendency to emit undefined for `is_correct` and `verdict` boolean fields. Defaulting to false hid the model bug from operators and left the orchestrator unable to distinguish "model emitted false" from "model emitted nothing". Now: when SimpleJudgeSchema or JudgeSchema fails parse, we send the prettified zod errors back to the LLM and retry up to 2 times, matching PR webllm#34's pattern for action-emission retries. If retries exhaust, surface the failure on the run's final ActionResult so harbor's failure_reason picks it up — `_run_simple_judge` marks the run as failed with a `[Judge schema invalid: ...]` note; `_judge_trace` synthesizes a verdict=false judgement with the schema error in `failure_reason`. Adds JudgeSchemaInvalidError (src/exceptions.ts) for the internal throw. A first-attempt shape check (any judge-related key present) preserves the prior graceful-skip path when the LLM returns a non-judge JSON shape entirely (e.g. an agent-step JSON in mocked tests), so we don't regress component tests that wire one mock LLM for both agent and judge calls. This pairs strict zod with feedback-driven self-correction (per reference_zod_pydantic_parity.md) instead of papering over the model bug with a default. Adds test/agent-judge-schema-retry.test.ts covering: (1) bu-2-0 missing-is_correct retry-then-recover, (2) retries exhaust → run marked failed, (3) network errors stay swallowed, (4) JudgeSchema verdict-missing exhaustion path, (5) verdict self-correction on retry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
caffeinum
added a commit
to caffeinum/browser-use
that referenced
this pull request
May 7, 2026
…d throw on exhaust Replaces the prior lenientBool default-false approach which silently masked bu-2-0's tendency to emit undefined for `is_correct` and `verdict` boolean fields. Defaulting to false hid the model bug from operators and left the orchestrator unable to distinguish "model emitted false" from "model emitted nothing". Now: when SimpleJudgeSchema or JudgeSchema fails parse, we send the prettified zod errors back to the LLM and retry up to 2 times, matching PR webllm#34's pattern for action-emission retries. If retries exhaust, surface the failure on the run's final ActionResult so harbor's failure_reason picks it up — `_run_simple_judge` marks the run as failed with a `[Judge schema invalid: ...]` note; `_judge_trace` synthesizes a verdict=false judgement with the schema error in `failure_reason`. Adds JudgeSchemaInvalidError (src/exceptions.ts) for the internal throw. A first-attempt shape check (any judge-related key present) preserves the prior graceful-skip path when the LLM returns a non-judge JSON shape entirely (e.g. an agent-step JSON in mocked tests), so we don't regress component tests that wire one mock LLM for both agent and judge calls. This pairs strict zod with feedback-driven self-correction (per reference_zod_pydantic_parity.md) instead of papering over the model bug with a default. Adds test/agent-judge-schema-retry.test.ts covering: (1) bu-2-0 missing-is_correct retry-then-recover, (2) retries exhaust → run marked failed, (3) network errors stay swallowed, (4) JudgeSchema verdict-missing exhaustion path, (5) verdict self-correction on retry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Member
|
browser-use v0.6.1 has been released. |
caffeinum
added a commit
to caffeinum/browser-use
that referenced
this pull request
May 19, 2026
…arity)
Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).
This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.
This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.
Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
caffeinum
added a commit
to caffeinum/browser-use
that referenced
this pull request
May 19, 2026
…arity)
Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).
This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.
This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.
Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
caffeinum
added a commit
to caffeinum/browser-use
that referenced
this pull request
Jun 13, 2026
…arity)
Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).
This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.
This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.
Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disclaimer: This is AI-generated, but I ran into this issue and tested the fix in prod
Parity with Python upstream
When an action's params fail zod validation, the TS port retries with the same prompt and no information about what went wrong. The model keeps making the same mistake until
max_failures(default 3) trips and the agent force-emitsdone(success=false). Python upstream catchespydantic.ValidationErrorand feeds the error message + sent params back to the LLM via thelast_result.error→message_managerinjection chain. This PR brings the TS implementation to parity.References on
browser-use/browser-usemain:browser_use/tools/registry/service.py:348-351— wraps pydanticValidationErrorasf'Invalid parameters {params} for action {action_name}: {type(e)}: {e}'(params echoed back + full error).browser_use/agent/service.py:1959-1961— re-raisesValidationErrorfor whole-output validation; comment notes "Pydantic's validation errors are already descriptive".browser_use/agent/message_manager/service.py:339-345— reads eachActionResult.errorfromlast_resultand appends it to theaction_resultsblock of the next history item the LLM sees.Note: Python's default
max_failureswas raised from 3 to 5 in browser-use/browser-use#4080.Fix
At
src/agent/service.ts:5624-5637, whenparamSchema.safeParse(rawParams)fails, throw anErrorwhose message contains:z.prettifyError(error)— human-readable bullet list of zod issues, one per failing field, with the actual path.The thrown error already flows through the existing
_handle_step_error→state.last_result→_prepare_state_messageschannel, so this PR adds no new injection mechanism. We just put useful text into an error that was previously a generic "validation failed".Impact
In the canary-env docker repro, this is the change that actually let bu-2-0 self-correct after a single misemit instead of cascading to a 3-fail bail. Combined with #33, the residual error rate dropped to ~5/run, all of which recovered in 1 step.
done(success=false, "failed multiple times")done(success=true, {action:"run_complete"})final_result: {}Tests
test/zod-error-feedback.test.tscovering the error message shape (prettified issues + echoed params + corrective hint).Caveats
message_manager/service.py:340-341. The TS fix doesn't truncate, so we send slightly more diagnostic info to the LLM. If reviewers want strict parity, easy follow-up to add the truncation.🤖 Generated with Claude Code