Skip to content

fix(agent): feed prettified zod issues + sent params back to LLM on retry#34

Merged
unadlib merged 1 commit into
webllm:mainfrom
caffeinum:fix/zod-error-feedback
May 7, 2026
Merged

fix(agent): feed prettified zod issues + sent params back to LLM on retry#34
unadlib merged 1 commit into
webllm:mainfrom
caffeinum:fix/zod-error-feedback

Conversation

@caffeinum

@caffeinum caffeinum commented May 6, 2026

Copy link
Copy Markdown
Contributor

Disclaimer: This is AI-generated, but I ran into this issue and tested the fix in prod

Parity with Python upstream

When an action's params fail zod validation, the TS port retries with the same prompt and no information about what went wrong. The model keeps making the same mistake until max_failures (default 3) trips and the agent force-emits done(success=false). Python upstream catches pydantic.ValidationError and feeds the error message + sent params back to the LLM via the last_result.errormessage_manager injection chain. This PR brings the TS implementation to parity.

References on browser-use/browser-use main:

  • browser_use/tools/registry/service.py:348-351 — wraps pydantic ValidationError as f'Invalid parameters {params} for action {action_name}: {type(e)}: {e}' (params echoed back + full error).
  • browser_use/agent/service.py:1959-1961 — re-raises ValidationError for whole-output validation; comment notes "Pydantic's validation errors are already descriptive".
  • browser_use/agent/message_manager/service.py:339-345 — reads each ActionResult.error from last_result and appends it to the action_results block of the next history item the LLM sees.

Note: Python's default max_failures was raised from 3 to 5 in browser-use/browser-use#4080.

Fix

At src/agent/service.ts:5624-5637, when paramSchema.safeParse(rawParams) fails, throw an Error whose message contains:

  1. z.prettifyError(error) — human-readable bullet list of zod issues, one per failing field, with the actual path.
  2. The sent params JSON echoed back, so the model sees exactly what it just emitted.
  3. An explicit corrective hint telling the model to re-emit the action with the corrected types.

The thrown error already flows through the existing _handle_step_errorstate.last_result_prepare_state_messages channel, so this PR adds no new injection mechanism. We just put useful text into an error that was previously a generic "validation failed".

Impact

In the canary-env docker repro, this is the change that actually let bu-2-0 self-correct after a single misemit instead of cascading to a 3-fail bail. Combined with #33, the residual error rate dropped to ~5/run, all of which recovered in 1 step.

before fixes after fixes
steps 5 20
BA final done(success=false, "failed multiple times") done(success=true, {action:"run_complete"})
orchestrator FATAL on final_result: {} clean run_complete
zod errors 20+ cascading 5, each self-corrected in 1 step
eval stuck running success, reward 0.87, score C 63.7/100

Tests

  • Added test/zod-error-feedback.test.ts covering the error message shape (prettified issues + echoed params + corrective hint).
  • Full suite: 968 pass / 0 fail.

Caveats

  • Python truncates errors at 200 chars in message_manager/service.py:340-341. The TS fix doesn't truncate, so we send slightly more diagnostic info to the LLM. If reviewers want strict parity, easy follow-up to add the truncation.
  • Error message size grows slightly when params are large, since we echo them back. Bounded by the action params, not by the conversation.
  • Independent of fix(registry): render action params as JSON Schema instead of zod _def dump #33 but most useful in combination — alone it cuts the cascade, alone it doesn't fix the root confusion.

🤖 Generated with Claude Code

…etry

When `_validateAndNormalizeActions` rejected an action's params via
`actionInfo.paramSchema.safeParse(rawParams)`, the thrown error message
was the raw `paramsResult.error.message` — i.e. zod v4's default JSON
dump of the `issues` array (`[{"expected":"number","code":"invalid_type",
"path":["num_pages"],"message":"Invalid input: expected number, received
boolean"}]`). This noisy blob did flow into `state.last_result` and into
the next `create_state_messages` turn, but it was hard for the model to
parse and gave no corrective hint, so the model retried with the same
mistake until `max_failures=3` tripped.

Use `z.prettifyError(paramsResult.error)` (zod v4 native) to render
issues as readable lines (e.g. `✖ Invalid input: expected number,
received boolean → at num_pages`), include the offending params verbatim
so the model can diff against the schema, and tag the message with an
explicit `Schema validation failed` prefix plus a `Please retry with
parameters matching the action's schema exactly` instruction.

The existing pipeline does the rest: thrown Error → `_handle_step_error`
→ `state.last_result = [ActionResult({error: ...})]` → next step's
`create_state_messages` injects it into the LLM context. No new
injection mechanism, just a better-shaped payload going through the
existing one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@caffeinum caffeinum marked this pull request as ready for review May 6, 2026 01:44
@caffeinum caffeinum requested a review from unadlib as a code owner May 6, 2026 01:44
@unadlib unadlib merged commit 6bb2692 into webllm:main May 7, 2026
caffeinum added a commit to caffeinum/browser-use that referenced this pull request May 7, 2026
…d throw on exhaust

Replaces the prior lenientBool default-false approach which silently
masked bu-2-0's tendency to emit undefined for `is_correct` and
`verdict` boolean fields. Defaulting to false hid the model bug from
operators and left the orchestrator unable to distinguish "model
emitted false" from "model emitted nothing".

Now: when SimpleJudgeSchema or JudgeSchema fails parse, we send the
prettified zod errors back to the LLM and retry up to 2 times,
matching PR webllm#34's pattern for action-emission retries. If retries
exhaust, surface the failure on the run's final ActionResult so
harbor's failure_reason picks it up — `_run_simple_judge` marks the
run as failed with a `[Judge schema invalid: ...]` note;
`_judge_trace` synthesizes a verdict=false judgement with the
schema error in `failure_reason`. Adds JudgeSchemaInvalidError
(src/exceptions.ts) for the internal throw.

A first-attempt shape check (any judge-related key present) preserves
the prior graceful-skip path when the LLM returns a non-judge JSON
shape entirely (e.g. an agent-step JSON in mocked tests), so we don't
regress component tests that wire one mock LLM for both agent and
judge calls.

This pairs strict zod with feedback-driven self-correction (per
reference_zod_pydantic_parity.md) instead of papering over the model
bug with a default. Adds test/agent-judge-schema-retry.test.ts
covering: (1) bu-2-0 missing-is_correct retry-then-recover, (2)
retries exhaust → run marked failed, (3) network errors stay
swallowed, (4) JudgeSchema verdict-missing exhaustion path, (5)
verdict self-correction on retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
caffeinum added a commit to caffeinum/browser-use that referenced this pull request May 7, 2026
…d throw on exhaust

Replaces the prior lenientBool default-false approach which silently
masked bu-2-0's tendency to emit undefined for `is_correct` and
`verdict` boolean fields. Defaulting to false hid the model bug from
operators and left the orchestrator unable to distinguish "model
emitted false" from "model emitted nothing".

Now: when SimpleJudgeSchema or JudgeSchema fails parse, we send the
prettified zod errors back to the LLM and retry up to 2 times,
matching PR webllm#34's pattern for action-emission retries. If retries
exhaust, surface the failure on the run's final ActionResult so
harbor's failure_reason picks it up — `_run_simple_judge` marks the
run as failed with a `[Judge schema invalid: ...]` note;
`_judge_trace` synthesizes a verdict=false judgement with the
schema error in `failure_reason`. Adds JudgeSchemaInvalidError
(src/exceptions.ts) for the internal throw.

A first-attempt shape check (any judge-related key present) preserves
the prior graceful-skip path when the LLM returns a non-judge JSON
shape entirely (e.g. an agent-step JSON in mocked tests), so we don't
regress component tests that wire one mock LLM for both agent and
judge calls.

This pairs strict zod with feedback-driven self-correction (per
reference_zod_pydantic_parity.md) instead of papering over the model
bug with a default. Adds test/agent-judge-schema-retry.test.ts
covering: (1) bu-2-0 missing-is_correct retry-then-recover, (2)
retries exhaust → run marked failed, (3) network errors stay
swallowed, (4) JudgeSchema verdict-missing exhaustion path, (5)
verdict self-correction on retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@unadlib

unadlib commented May 8, 2026

Copy link
Copy Markdown
Member

browser-use v0.6.1 has been released.

@caffeinum caffeinum deleted the fix/zod-error-feedback branch May 19, 2026 18:23
caffeinum added a commit to caffeinum/browser-use that referenced this pull request May 19, 2026
…arity)

Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).

This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.

This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.

Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
caffeinum added a commit to caffeinum/browser-use that referenced this pull request May 19, 2026
…arity)

Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).

This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.

This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.

Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
caffeinum added a commit to caffeinum/browser-use that referenced this pull request Jun 13, 2026
…arity)

Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).

This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.

This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.

Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants