fix(agent): feed prettified zod issues + sent params back to LLM on retry by caffeinum · Pull Request #34 · webllm/browser-use

caffeinum · 2026-05-06T01:32:27Z

Disclaimer: This is AI-generated, but I ran into this issue and tested the fix in prod

Parity with Python upstream

When an action's params fail zod validation, the TS port retries with the same prompt and no information about what went wrong. The model keeps making the same mistake until max_failures (default 3) trips and the agent force-emits done(success=false). Python upstream catches pydantic.ValidationError and feeds the error message + sent params back to the LLM via the last_result.error → message_manager injection chain. This PR brings the TS implementation to parity.

References on browser-use/browser-use main:

browser_use/tools/registry/service.py:348-351 — wraps pydantic ValidationError as f'Invalid parameters {params} for action {action_name}: {type(e)}: {e}' (params echoed back + full error).
browser_use/agent/service.py:1959-1961 — re-raises ValidationError for whole-output validation; comment notes "Pydantic's validation errors are already descriptive".
browser_use/agent/message_manager/service.py:339-345 — reads each ActionResult.error from last_result and appends it to the action_results block of the next history item the LLM sees.

Note: Python's default max_failures was raised from 3 to 5 in browser-use/browser-use#4080.

Fix

At src/agent/service.ts:5624-5637, when paramSchema.safeParse(rawParams) fails, throw an Error whose message contains:

z.prettifyError(error) — human-readable bullet list of zod issues, one per failing field, with the actual path.
The sent params JSON echoed back, so the model sees exactly what it just emitted.
An explicit corrective hint telling the model to re-emit the action with the corrected types.

The thrown error already flows through the existing _handle_step_error → state.last_result → _prepare_state_messages channel, so this PR adds no new injection mechanism. We just put useful text into an error that was previously a generic "validation failed".

Impact

In the canary-env docker repro, this is the change that actually let bu-2-0 self-correct after a single misemit instead of cascading to a 3-fail bail. Combined with #33, the residual error rate dropped to ~5/run, all of which recovered in 1 step.

	before fixes	after fixes
steps	5	20
BA final	`done(success=false, "failed multiple times")`	`done(success=true, {action:"run_complete"})`
orchestrator	FATAL on `final_result: {}`	clean run_complete
zod errors	20+ cascading	5, each self-corrected in 1 step
eval	stuck running	success, reward 0.87, score C 63.7/100

Tests

Added test/zod-error-feedback.test.ts covering the error message shape (prettified issues + echoed params + corrective hint).
Full suite: 968 pass / 0 fail.

Caveats

Python truncates errors at 200 chars in message_manager/service.py:340-341. The TS fix doesn't truncate, so we send slightly more diagnostic info to the LLM. If reviewers want strict parity, easy follow-up to add the truncation.
Error message size grows slightly when params are large, since we echo them back. Bounded by the action params, not by the conversation.
Independent of fix(registry): render action params as JSON Schema instead of zod _def dump #33 but most useful in combination — alone it cuts the cascade, alone it doesn't fix the root confusion.

🤖 Generated with Claude Code

…etry When `_validateAndNormalizeActions` rejected an action's params via `actionInfo.paramSchema.safeParse(rawParams)`, the thrown error message was the raw `paramsResult.error.message` — i.e. zod v4's default JSON dump of the `issues` array (`[{"expected":"number","code":"invalid_type", "path":["num_pages"],"message":"Invalid input: expected number, received boolean"}]`). This noisy blob did flow into `state.last_result` and into the next `create_state_messages` turn, but it was hard for the model to parse and gave no corrective hint, so the model retried with the same mistake until `max_failures=3` tripped. Use `z.prettifyError(paramsResult.error)` (zod v4 native) to render issues as readable lines (e.g. `✖ Invalid input: expected number, received boolean → at num_pages`), include the offending params verbatim so the model can diff against the schema, and tag the message with an explicit `Schema validation failed` prefix plus a `Please retry with parameters matching the action's schema exactly` instruction. The existing pipeline does the rest: thrown Error → `_handle_step_error` → `state.last_result = [ActionResult({error: ...})]` → next step's `create_state_messages` injects it into the LLM context. No new injection mechanism, just a better-shaped payload going through the existing one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d throw on exhaust Replaces the prior lenientBool default-false approach which silently masked bu-2-0's tendency to emit undefined for `is_correct` and `verdict` boolean fields. Defaulting to false hid the model bug from operators and left the orchestrator unable to distinguish "model emitted false" from "model emitted nothing". Now: when SimpleJudgeSchema or JudgeSchema fails parse, we send the prettified zod errors back to the LLM and retry up to 2 times, matching PR webllm#34's pattern for action-emission retries. If retries exhaust, surface the failure on the run's final ActionResult so harbor's failure_reason picks it up — `_run_simple_judge` marks the run as failed with a `[Judge schema invalid: ...]` note; `_judge_trace` synthesizes a verdict=false judgement with the schema error in `failure_reason`. Adds JudgeSchemaInvalidError (src/exceptions.ts) for the internal throw. A first-attempt shape check (any judge-related key present) preserves the prior graceful-skip path when the LLM returns a non-judge JSON shape entirely (e.g. an agent-step JSON in mocked tests), so we don't regress component tests that wire one mock LLM for both agent and judge calls. This pairs strict zod with feedback-driven self-correction (per reference_zod_pydantic_parity.md) instead of papering over the model bug with a default. Adds test/agent-judge-schema-retry.test.ts covering: (1) bu-2-0 missing-is_correct retry-then-recover, (2) retries exhaust → run marked failed, (3) network errors stay swallowed, (4) JudgeSchema verdict-missing exhaustion path, (5) verdict self-correction on retry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

unadlib · 2026-05-08T17:45:40Z

browser-use v0.6.1 has been released.

…arity) Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits `{input_text: {index: true, text: "..."}}` (boolean) where the schema expects an integer. python upstream silently coerces `True -> 1` / `False -> 0` via pydantic's default lax mode. zod (TS) hard-rejects with `expected number, received boolean`, the agent retries with the same broken output, and bails at max_failures. observed bail mode in production for every auth0-style form fill (daytona, zeroentropy, kernel, browserbase). This patch ports the lax-coercion behavior to TS at the validation boundary. A `lenientInt(min)` helper preprocesses booleans into numbers before delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()` covering `num_pages` / `pages` floats. Helper is applied only to LLM-emitted index/element-index/page-count fields where pydantic's silent coercion is documented behavior. Fields where bool->0/1 would be semantically wrong (timeout, delay, max_results, coordinate_x/y) are left strict to avoid masking a different model bug. This is a graceful-degradation patch, not a fix to the model. bu-2-0 should not emit booleans for integer fields. With this patch the agent now progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on bad index choice rather than looping to max_failures. Helpers + per-schema regression coverage in `test/coerce-boolean-to-int.test.ts` (23 tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

caffeinum mentioned this pull request May 6, 2026

fix(registry): render action params as JSON Schema instead of zod _def dump #33

Merged

caffeinum marked this pull request as ready for review May 6, 2026 01:44

caffeinum requested a review from unadlib as a code owner May 6, 2026 01:44

unadlib approved these changes May 7, 2026

View reviewed changes

unadlib merged commit 6bb2692 into webllm:main May 7, 2026

caffeinum mentioned this pull request May 7, 2026

fix(agent): coerce missing booleans in judge schemas (pydantic parity) #36

Closed

caffeinum mentioned this pull request May 7, 2026

fix(agent): retry judge schema validation with prettified errors #37

Closed

caffeinum deleted the fix/zod-error-feedback branch May 19, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): feed prettified zod issues + sent params back to LLM on retry#34

fix(agent): feed prettified zod issues + sent params back to LLM on retry#34
unadlib merged 1 commit into
webllm:mainfrom
caffeinum:fix/zod-error-feedback

caffeinum commented May 6, 2026 •

edited

Loading

Uh oh!

unadlib commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caffeinum commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parity with Python upstream

Fix

Impact

Tests

Caveats

Uh oh!

unadlib commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

caffeinum commented May 6, 2026 •

edited

Loading