Fix/computer use aux vision routing#24070
Conversation
|
It seems that the tests have not passed |
Add tools/computer_use/vision_routing.py with
should_route_capture_to_aux_vision(provider, model, cfg) — a small
policy helper that decides whether a captured screenshot should be
returned as a multimodal envelope (main model has native vision) or
pre-analysed through the auxiliary.vision pipeline so the main model
only sees text.
The decision mirrors agent.image_routing.decide_image_input_mode for
user-attached images, so the capture path and the user-turn path agree
on what counts as an explicit aux vision override:
* provider/model/base_url under auxiliary.vision => explicit override
=> route through aux vision
* provider+model accepts multimodal tool results AND main model
reports supports_vision=True => keep multimodal envelope
* everything else (no tool-result image support, non-vision model,
metadata lookup failure) => fail closed and route through aux
No call sites are changed in this commit; the helper is added in
isolation so the routing decision can be unit-tested before it is
plumbed into _capture_response().
Add tests/tools/test_computer_use_vision_routing.py — 28 unit tests
that pin the contract of the new vision-routing helper introduced in
the previous commit:
* TestExplicitAuxVisionOverride (12 cases): mirror the
auxiliary.vision detection rules used by agent.image_routing so
the capture path and the user-attached-image path agree on what
counts as an explicit override (provider/model/base_url with
non-blank, non-'auto' values).
* TestRouteDecision (7 cases): pin the policy itself — explicit
override always wins, vision-capable + native-tool-result keeps
multimodal, everything else fails closed and routes to aux.
* TestLookupHelpers (5 cases): defensive paths for the models.dev /
tool-result-support lookups (blank inputs, exceptions, missing
caps).
* TestModuleSurface (4 cases): pin the public/__all__ surface and
keep internal helpers addressable so the integration test in the
next commit can monkeypatch them deterministically.
Run with:
scripts/run_tests.sh tests/tools/test_computer_use_vision_routing.py
…usResearch#24015) When the active main model has no vision capability — or when the user explicitly configured auxiliary.vision in config.yaml — sending the captured screenshot back to the main model in a multimodal tool-result envelope is the wrong move: it trips HTTP 404 / 400 at the provider boundary (e.g. 'No endpoints found that support image input') and the agent loop reports a hard tool failure for what should have been a simple capture. The reporter on NousResearch#24015 hit this with: model: default: tencent/hy3-preview # no vision support provider: openrouter auxiliary: vision: provider: openrouter model: google/gemini-2.5-flash # explicitly configured …and observed: computer_use(action='capture', mode='som') →⚠️ API call failed (attempt1/3): NotFoundError [HTTP 404] 🔌 Provider: openrouter Model: tencent/hy3-preview 📝 Error: HTTP 404: No endpoints found that support image input Fix: in tools/computer_use/tool.py::_capture_response, after a screenshot is captured (modes 'som' / 'vision'), consult the routing helper introduced earlier in this branch. When it says 'route to aux', materialise the PNG to $HERMES_HOME/cache/vision/, run vision_analyze on it (which honours auxiliary.vision via the standard async_call_llm task='vision' router), and return a text-only JSON tool result that embeds the analysis alongside the existing AX/SOM index. The main model never sees the pixels — it sees an actionable text description plus the same set-of-mark element index it normally uses. The two new helpers (_should_route_through_aux_vision, _route_capture_through_aux_vision) keep the policy and the IO separated so each can be tested in isolation. Both fail open: if the config import fails, if the aux call raises, or if the analysis is empty, we fall back to the existing multimodal envelope so the behaviour is at worst the pre-fix status quo. Temp screenshot files are cleaned up unconditionally in a finally block — even on aux call failure — to avoid leaving residue under cache/vision/. The end-to-end regression for NousResearch#24015 is added in the next commit.
…search#24015) Add tests/tools/test_computer_use_capture_routing.py — 13 integration tests that drive _capture_response end-to-end with deterministic stubs for the routing helper, _run_async, vision_analyze_tool, and get_hermes_dir, so the full code path is exercised without a live cua-driver, real auxiliary client, or network access. Coverage: * TestCaptureResponseDefaultPath (3 cases) - SOM PNG capture returns the legacy multimodal envelope when the routing helper says 'native' (image/png MIME). - Same path returns image/jpeg MIME for JPEG payloads (cua-driver can return either). - AX-only mode never even consults the routing helper because no PNG is present. * TestCaptureResponseRoutedToAuxVision (5 cases) - SOM capture with routing on returns a JSON string with the vision_analysis embedded, the AX/SOM index preserved, and NO image_url parts. Verifies the aux call receives a path under the configured cache and a prompt that grounds itself against the AX summary. - Temp screenshot file is unlinked after _capture_response returns, including when the aux call raises (the finally block runs). - Empty / malformed aux analysis falls back to the multimodal envelope so the user always gets *something* useful. * TestRoutingDecisionWiring (4 cases) - Explicit auxiliary.vision in config flips routing on regardless of main-model vision capability. - Vision-capable main + native tool-result support keeps multimodal. - Config load failure fails open (returns False, multimodal path continues to work). - Helper exception is swallowed and routes to legacy behaviour. * TestBugReproductionAnchor (1 case) - directly pins the NousResearch#24015 contract: when routing is on, the response must NEVER contain a 'data:image' or 'image_url' substring. That is exactly what tripped the reporter's HTTP 404 ('No endpoints found that support image input') on tencent/hy3-preview before the fix. Bug-reproduction proof: $ git checkout upstream/main -- tools/computer_use/tool.py $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py ============================== 13 failed in 1.29s ============================== $ # restore tool.py to this branch's HEAD $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py ============================== 13 passed in 1.04s ============================== Total branch coverage: 85 passed across test_computer_use.py, test_computer_use_vision_routing.py, test_computer_use_capture_routing.py
bfb44ff to
9365234
Compare
|
@RootMePLS Thanks for flagging! The earlier failures were stale-base issues. I've now rebased the branch onto current After rebase, all checks that depend on this branch's code pass:
The remaining failures are all pre-existing on
Bug-fix tests for this PRThe fix's own test suite passes locally and in CI lint:
Bug-reproduction proof still holds: revert Happy to address any code feedback — the failing checks are all environmental / pre-existing, not signal on the fix itself. |
|
@xxxigm thank you for working on this! |
What does this PR do?
Fixes #24015. The
computer_usetool'scaptureaction (mode='som'/mode='vision') used to always return a_multimodalenvelope containing the screenshot, which was then delivered to the active main session model as the tool result. When the active main model has no vision capability — or when the user explicitly configuredauxiliary.visioninconfig.yaml— that envelope tripped HTTP 404 / 400 at the provider boundary (e.g.No endpoints found that support image input) and the agent loop reported a hard tool failure.Reporter's repro:
This PR adds the same routing policy that
vision_analyzealready uses for user-attached images: when an explicitauxiliary.visionblock is set, OR the active main+provider can't carry images inside tool-result messages, OR the main model reports no vision capability, the captured PNG is materialised under$HERMES_HOME/cache/vision/, handed tovision_analyze_tool(which honoursauxiliary.visionvia the standardasync_call_llm(task='vision', ...)router), and the result is returned as a text-only JSON tool message that embeds the analysis alongside the existing AX/SOM index. The main model never sees the pixels — it sees an actionable text description plus the same set-of-mark element index it normally uses.The decision deliberately fails open: any failure (config import, helper exception, aux LLM crash, empty analysis) falls back to the legacy multimodal envelope. The temp screenshot file is unlinked unconditionally in a
finallyblock.Related Issue
Fixes #24015
Type of Change
Changes Made
Four commits, fix-test alternation, +992 lines, −0 lines:
fix(computer_use): add helper to decide capture vision routing— new moduletools/computer_use/vision_routing.py(+152 lines) withshould_route_capture_to_aux_vision(provider, model, cfg)plus the small lookup helpers it composes. Mirrorsagent.image_routing.decide_image_input_modeso capture-routing and user-attached-image routing agree on what counts as an explicit aux override.test(computer_use): cover capture vision-routing helper—tests/tools/test_computer_use_vision_routing.py(+260 lines, 28 unit tests) pinning the helper's contract: explicit override detection (12 cases), policy decision (7 cases), defensive lookups (5 cases), public surface (4 cases).fix(computer_use): route SOM/vision captures via auxiliary.vision (#24015)—tools/computer_use/tool.py(+149 lines): adds_should_route_through_aux_vision()(reads main provider/model + config, asks the helper, fails open) and_route_capture_through_aux_vision(cap, summary)(decodes the base64 PNG to$HERMES_HOME/cache/vision/, runsvision_analyze_toolviamodel_tools._run_async, returns a JSON text response). Wires both into_capture_response()ahead of the existing multimodal envelope branch.test(computer_use): end-to-end regression for capture routing (#24015)—tests/tools/test_computer_use_capture_routing.py(+431 lines, 13 integration tests) driving_capture_responseend-to-end with deterministic stubs: default native path (3), routed-to-aux path including temp-file cleanup on success/failure (5), routing-decision wiring with real config plumbing (4), and a bug-reproduction anchor that asserts the response never containsdata:image/image_urlwhen routing is on (1).How to Test
Check out this branch and ensure
.venvis set up.Run the fix's full test surface (helper + integration + existing computer_use suite):
Expected: 85 passed (44 pre-existing in
test_computer_use.py+ 28 new helper tests + 13 new integration tests).Bug-reproduction proof — without the production fix, every integration test fails:
Optional — manual repro on macOS with the reporter's config:
model.default: tencent/hy3-preview(or any non-vision OpenRouter model) andauxiliary.vision.{provider,model}: openrouter / google/gemini-2.5-flash.computer_use action=capture mode=som.No endpoints found that support image inputon the next agent turn.vision_analysis_routed_via: "auxiliary.vision"and the main model receives a description it can act on.Checklist
Code
fix(computer_use): ...,test(computer_use): ...)Documentation & Housekeeping
docs/, docstrings) — N/A; module-level docstrings in the two new files document the policy in detailcli-config.yaml.exampleif I added/changed config keys — N/A (no new config keys; reuses existingauxiliary.vision)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/Acomputer_useis macOS-only viacua-driver; the fix path (text temp file under$HERMES_HOME/cache/vision/,Pathoperations) is portable_multimodalenvelope or JSON string, matching the documented contractScreenshots / Logs
Notes for reviewers
agent.image_routing._explicit_aux_vision_overrideso the capture path and the user-attached-image path stay in lockstep. If the project ever changes how it detects an explicitauxiliary.visionoverride, both call sites will need to move together — the lockstep is intentional, not a copy-paste mistake._capture_response()(and transitively_maybe_follow_capture()via_capture_response()). Thevisionandsommodes both go through this path;axmode is unchanged because no PNG is ever produced.model_tools._run_asyncis the project-standard sync→async bridge already used by every other handler that needs to call an async helper from a synchandle_*registration. No new bridge code is introduced.