Fix/computer use aux vision routing by xxxigm · Pull Request #24070 · NousResearch/hermes-agent

xxxigm · 2026-05-12T00:30:10Z

What does this PR do?

Fixes #24015. The computer_use tool's capture action (mode='som' / mode='vision') used to always return a _multimodal envelope containing the screenshot, which was then delivered to the active main session model as the tool result. When the active main model has no vision capability — or when the user explicitly configured auxiliary.vision in config.yaml — that envelope tripped HTTP 404 / 400 at the provider boundary (e.g. No endpoints found that support image input) and the agent loop reported a hard tool failure.

Reporter's repro:

model:
  default: tencent/hy3-preview      # no vision support
  provider: openrouter
auxiliary:
  vision:
    provider: openrouter
    model: google/gemini-2.5-flash  # explicitly configured, never used

computer_use(action='capture', mode='som')
→ ⚠️ API call failed (attempt1/3): NotFoundError [HTTP 404]
   🔌 Provider: openrouter  Model: tencent/hy3-preview
   📝 Error: HTTP 404: No endpoints found that support image input

This PR adds the same routing policy that vision_analyze already uses for user-attached images: when an explicit auxiliary.vision block is set, OR the active main+provider can't carry images inside tool-result messages, OR the main model reports no vision capability, the captured PNG is materialised under $HERMES_HOME/cache/vision/, handed to vision_analyze_tool (which honours auxiliary.vision via the standard async_call_llm(task='vision', ...) router), and the result is returned as a text-only JSON tool message that embeds the analysis alongside the existing AX/SOM index. The main model never sees the pixels — it sees an actionable text description plus the same set-of-mark element index it normally uses.

The decision deliberately fails open: any failure (config import, helper exception, aux LLM crash, empty analysis) falls back to the legacy multimodal envelope. The temp screenshot file is unlinked unconditionally in a finally block.

Related Issue

Fixes #24015

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

Four commits, fix-test alternation, +992 lines, −0 lines:

fix(computer_use): add helper to decide capture vision routing — new module tools/computer_use/vision_routing.py (+152 lines) with should_route_capture_to_aux_vision(provider, model, cfg) plus the small lookup helpers it composes. Mirrors agent.image_routing.decide_image_input_mode so capture-routing and user-attached-image routing agree on what counts as an explicit aux override.
test(computer_use): cover capture vision-routing helper — tests/tools/test_computer_use_vision_routing.py (+260 lines, 28 unit tests) pinning the helper's contract: explicit override detection (12 cases), policy decision (7 cases), defensive lookups (5 cases), public surface (4 cases).
fix(computer_use): route SOM/vision captures via auxiliary.vision (#24015) — tools/computer_use/tool.py (+149 lines): adds _should_route_through_aux_vision() (reads main provider/model + config, asks the helper, fails open) and _route_capture_through_aux_vision(cap, summary) (decodes the base64 PNG to $HERMES_HOME/cache/vision/, runs vision_analyze_tool via model_tools._run_async, returns a JSON text response). Wires both into _capture_response() ahead of the existing multimodal envelope branch.
test(computer_use): end-to-end regression for capture routing (#24015) — tests/tools/test_computer_use_capture_routing.py (+431 lines, 13 integration tests) driving _capture_response end-to-end with deterministic stubs: default native path (3), routed-to-aux path including temp-file cleanup on success/failure (5), routing-decision wiring with real config plumbing (4), and a bug-reproduction anchor that asserts the response never contains data:image / image_url when routing is on (1).

How to Test

Check out this branch and ensure .venv is set up.
Run the fix's full test surface (helper + integration + existing computer_use suite):
```
scripts/run_tests.sh tests/tools/test_computer_use.py \
  tests/tools/test_computer_use_vision_routing.py \
  tests/tools/test_computer_use_capture_routing.py
```
Expected: 85 passed (44 pre-existing in test_computer_use.py + 28 new helper tests + 13 new integration tests).

Bug-reproduction proof — without the production fix, every integration test fails:

$ git checkout upstream/main -- tools/computer_use/tool.py
$ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py
============================== 13 failed in 1.29s ==============================

$ git checkout HEAD -- tools/computer_use/tool.py    # restore the fix
$ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py
============================== 13 passed in 1.04s ==============================

Optional — manual repro on macOS with the reporter's config:
- Set model.default: tencent/hy3-preview (or any non-vision OpenRouter model) and auxiliary.vision.{provider,model}: openrouter / google/gemini-2.5-flash.
- Trigger computer_use action=capture mode=som.
- Before this PR: HTTP 404 No endpoints found that support image input on the next agent turn.
- After this PR: tool returns a JSON text result with vision_analysis_routed_via: "auxiliary.vision" and the main model receives a description it can act on.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(computer_use): ..., test(computer_use): ...)
I searched for existing PRs to make sure this isn't a duplicate — no PR mentioned in cua-driver ignores auxiliary.vision config, uses main session model for image analysis #24015 timeline at branch creation time
My PR contains only changes related to this fix (no unrelated commits)
I've run the touched test suite and all 85 tests pass; bug-reproduction proof above shows the new 13 integration tests fail without the production fix
I've added tests for my changes (28 unit + 13 integration = 41 new tests)
I've tested on my platform: macOS 15.6 (Darwin 24.6.0), Python 3.12.5

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — N/A; module-level docstrings in the two new files document the policy in detail
I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config keys; reuses existing auxiliary.vision)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — computer_use is macOS-only via cua-driver; the fix path (text temp file under $HERMES_HOME/cache/vision/, Path operations) is portable
I've updated tool descriptions/schemas if I changed tool behavior — N/A; tool result shape is still either _multimodal envelope or JSON string, matching the documented contract

Screenshots / Logs

$ scripts/run_tests.sh tests/tools/test_computer_use.py \
    tests/tools/test_computer_use_vision_routing.py \
    tests/tools/test_computer_use_capture_routing.py
4 workers [85 items]
============================== 85 passed in 1.49s ==============================

$ .venv/bin/ruff check tools/computer_use/ \
    tests/tools/test_computer_use_vision_routing.py \
    tests/tools/test_computer_use_capture_routing.py
All checks passed!

Notes for reviewers

The new helper deliberately mirrors agent.image_routing._explicit_aux_vision_override so the capture path and the user-attached-image path stay in lockstep. If the project ever changes how it detects an explicit auxiliary.vision override, both call sites will need to move together — the lockstep is intentional, not a copy-paste mistake.
The fix is invoked exclusively from _capture_response() (and transitively _maybe_follow_capture() via _capture_response()). The vision and som modes both go through this path; ax mode is unchanged because no PNG is ever produced.
The aux call is intentionally synchronous from the tool's perspective — model_tools._run_async is the project-standard sync→async bridge already used by every other handler that needs to call an async helper from a sync handle_* registration. No new bridge code is introduced.

RootMePLS · 2026-05-15T08:02:05Z

It seems that the tests have not passed

Add tools/computer_use/vision_routing.py with should_route_capture_to_aux_vision(provider, model, cfg) — a small policy helper that decides whether a captured screenshot should be returned as a multimodal envelope (main model has native vision) or pre-analysed through the auxiliary.vision pipeline so the main model only sees text. The decision mirrors agent.image_routing.decide_image_input_mode for user-attached images, so the capture path and the user-turn path agree on what counts as an explicit aux vision override: * provider/model/base_url under auxiliary.vision => explicit override => route through aux vision * provider+model accepts multimodal tool results AND main model reports supports_vision=True => keep multimodal envelope * everything else (no tool-result image support, non-vision model, metadata lookup failure) => fail closed and route through aux No call sites are changed in this commit; the helper is added in isolation so the routing decision can be unit-tested before it is plumbed into _capture_response().

Add tests/tools/test_computer_use_vision_routing.py — 28 unit tests that pin the contract of the new vision-routing helper introduced in the previous commit: * TestExplicitAuxVisionOverride (12 cases): mirror the auxiliary.vision detection rules used by agent.image_routing so the capture path and the user-attached-image path agree on what counts as an explicit override (provider/model/base_url with non-blank, non-'auto' values). * TestRouteDecision (7 cases): pin the policy itself — explicit override always wins, vision-capable + native-tool-result keeps multimodal, everything else fails closed and routes to aux. * TestLookupHelpers (5 cases): defensive paths for the models.dev / tool-result-support lookups (blank inputs, exceptions, missing caps). * TestModuleSurface (4 cases): pin the public/__all__ surface and keep internal helpers addressable so the integration test in the next commit can monkeypatch them deterministically. Run with: scripts/run_tests.sh tests/tools/test_computer_use_vision_routing.py

…usResearch#24015) When the active main model has no vision capability — or when the user explicitly configured auxiliary.vision in config.yaml — sending the captured screenshot back to the main model in a multimodal tool-result envelope is the wrong move: it trips HTTP 404 / 400 at the provider boundary (e.g. 'No endpoints found that support image input') and the agent loop reports a hard tool failure for what should have been a simple capture. The reporter on NousResearch#24015 hit this with: model: default: tencent/hy3-preview # no vision support provider: openrouter auxiliary: vision: provider: openrouter model: google/gemini-2.5-flash # explicitly configured …and observed: computer_use(action='capture', mode='som') → ⚠️ API call failed (attempt1/3): NotFoundError [HTTP 404] 🔌 Provider: openrouter Model: tencent/hy3-preview 📝 Error: HTTP 404: No endpoints found that support image input Fix: in tools/computer_use/tool.py::_capture_response, after a screenshot is captured (modes 'som' / 'vision'), consult the routing helper introduced earlier in this branch. When it says 'route to aux', materialise the PNG to $HERMES_HOME/cache/vision/, run vision_analyze on it (which honours auxiliary.vision via the standard async_call_llm task='vision' router), and return a text-only JSON tool result that embeds the analysis alongside the existing AX/SOM index. The main model never sees the pixels — it sees an actionable text description plus the same set-of-mark element index it normally uses. The two new helpers (_should_route_through_aux_vision, _route_capture_through_aux_vision) keep the policy and the IO separated so each can be tested in isolation. Both fail open: if the config import fails, if the aux call raises, or if the analysis is empty, we fall back to the existing multimodal envelope so the behaviour is at worst the pre-fix status quo. Temp screenshot files are cleaned up unconditionally in a finally block — even on aux call failure — to avoid leaving residue under cache/vision/. The end-to-end regression for NousResearch#24015 is added in the next commit.

…search#24015) Add tests/tools/test_computer_use_capture_routing.py — 13 integration tests that drive _capture_response end-to-end with deterministic stubs for the routing helper, _run_async, vision_analyze_tool, and get_hermes_dir, so the full code path is exercised without a live cua-driver, real auxiliary client, or network access. Coverage: * TestCaptureResponseDefaultPath (3 cases) - SOM PNG capture returns the legacy multimodal envelope when the routing helper says 'native' (image/png MIME). - Same path returns image/jpeg MIME for JPEG payloads (cua-driver can return either). - AX-only mode never even consults the routing helper because no PNG is present. * TestCaptureResponseRoutedToAuxVision (5 cases) - SOM capture with routing on returns a JSON string with the vision_analysis embedded, the AX/SOM index preserved, and NO image_url parts. Verifies the aux call receives a path under the configured cache and a prompt that grounds itself against the AX summary. - Temp screenshot file is unlinked after _capture_response returns, including when the aux call raises (the finally block runs). - Empty / malformed aux analysis falls back to the multimodal envelope so the user always gets *something* useful. * TestRoutingDecisionWiring (4 cases) - Explicit auxiliary.vision in config flips routing on regardless of main-model vision capability. - Vision-capable main + native tool-result support keeps multimodal. - Config load failure fails open (returns False, multimodal path continues to work). - Helper exception is swallowed and routes to legacy behaviour. * TestBugReproductionAnchor (1 case) - directly pins the NousResearch#24015 contract: when routing is on, the response must NEVER contain a 'data:image' or 'image_url' substring. That is exactly what tripped the reporter's HTTP 404 ('No endpoints found that support image input') on tencent/hy3-preview before the fix. Bug-reproduction proof: $ git checkout upstream/main -- tools/computer_use/tool.py $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py ============================== 13 failed in 1.29s ============================== $ # restore tool.py to this branch's HEAD $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py ============================== 13 passed in 1.04s ============================== Total branch coverage: 85 passed across test_computer_use.py, test_computer_use_vision_routing.py, test_computer_use_capture_routing.py

xxxigm · 2026-05-15T17:06:20Z

@RootMePLS Thanks for flagging! The earlier failures were stale-base issues. I've now rebased the branch onto current upstream/main (HEAD 9fb40e6a3) and force-pushed — clean rebase, no conflicts, all 4 fix-test alternation commits preserved.

After rebase, all checks that depend on this branch's code pass:

Check	Before rebase	After rebase
`Windows footguns (blocking)`	❌ FAILURE (stale base, `tools/process_registry.py:588` — branch never touched this file)	✅ SUCCESS
`ruff enforcement (blocking)`	✅	✅
`ruff + ty diff`	✅	✅
`Check PyPI dependency upper bounds`	n/a	✅
`nix (ubuntu-latest)`	✅	✅
`nix (macos-latest)`	✅	✅
`build-arm64`	✅	✅
Supply Chain Audit	✅	✅

The remaining failures are all pre-existing on upstream/main, not caused by this branch:

test job — 3 failures in tests/run_agent/test_provider_parity.py (Nous Portal context-length validation):

FAILED TestDeveloperRoleSwap::test_developer_role_via_nous_portal
FAILED TestBuildApiKwargsNousPortal::test_includes_nous_product_tags
FAILED TestBuildApiKwargsNousPortal::test_uses_chat_completions_format
→ ValueError: Model has a context window of 15,000 tokens, below minimum 64,000

Same 3 failures on upstream/main HEAD 9fb40e6a3 (run 25923943151 / job 76200003383). Nothing to do with tools/computer_use/.

e2e job — 4 failures in tests/e2e/test_discord_adapter.py (Discord mock SimpleNamespace missing history, discord.Forbidden not a real exception class in tests). Same 4 failures on upstream/main HEAD 9fb40e6a3. Introduced by recent Discord history-backfill PRs (#4abfb6bc2, #e84fe483b), unrelated to capture routing.
build-amd64 — Docker runner disk space: no space left on device writing lightningcss-linux-x64-gnu.node. Infra issue — build-arm64 (same commit) succeeded.

Bug-fix tests for this PR

The fix's own test suite passes locally and in CI lint:

$ scripts/run_tests.sh tests/tools/test_computer_use.py \
    tests/tools/test_computer_use_vision_routing.py \
    tests/tools/test_computer_use_capture_routing.py
============================== 88 passed in 1.43s ==============================

47 pre-existing test_computer_use.py tests — still green
28 new helper tests (test_computer_use_vision_routing.py)
13 new integration tests (test_computer_use_capture_routing.py)

Bug-reproduction proof still holds: revert tools/computer_use/tool.py to upstream/main and all 13 integration tests fail with assert "data:image" not in resp etc.

Happy to address any code feedback — the failing checks are all environmental / pre-existing, not signal on the fix itself.

RootMePLS · 2026-05-18T07:36:32Z

@xxxigm thank you for working on this!
It seems not all tests passed.

teknium1 · 2026-05-22T00:38:28Z

Salvaged via PR #30126 (commit bec2250 on main). Your 4 commits were cherry-picked onto current main with your authorship preserved. Thanks for the thorough fix — 41 new tests + bug-repro anchor + fail-open semantics made this a clean review. Closes #24015 and #29407 (dup).

alt-glitch added type/bug Something isn't working comp/tools Tool registry, model_tools, toolsets tool/vision Vision analysis and image generation P2 Medium — degraded but workaround exists labels May 12, 2026

helix4u mentioned this pull request May 14, 2026

fix(agent): keep image tool results from poisoning text-only sessions #25903

Closed

19 tasks

teknium1 mentioned this pull request May 14, 2026

fix(agent): keep image tool results from poisoning text-only sessions #25925

Merged

xxxigm added 4 commits May 15, 2026 23:44

xxxigm force-pushed the fix/computer-use-aux-vision-routing branch from bfb44ff to 9365234 Compare May 15, 2026 16:45

alt-glitch mentioned this pull request May 16, 2026

fix(vision): route image analysis through active thread model #27015

Closed

23 tasks

alt-glitch mentioned this pull request May 20, 2026

[Feature] computer_use: route screenshots through auxiliary.vision when main model lacks vision #29407

Closed

This was referenced May 21, 2026

fix(computer_use): two bugs blocking cua-driver integration #24232

Closed

fix(computer_use): route SOM/vision captures via auxiliary.vision (#24015) #30126

Merged

teknium1 closed this in #30126 May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/computer use aux vision routing#24070

Fix/computer use aux vision routing#24070
xxxigm wants to merge 4 commits into
NousResearch:mainfrom
xxxigm:fix/computer-use-aux-vision-routing

xxxigm commented May 12, 2026

Uh oh!

RootMePLS commented May 15, 2026

Uh oh!

xxxigm commented May 15, 2026

Uh oh!

RootMePLS commented May 18, 2026 •

edited

Loading

Uh oh!

teknium1 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xxxigm commented May 12, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Notes for reviewers

Uh oh!

RootMePLS commented May 15, 2026

Uh oh!

xxxigm commented May 15, 2026

Bug-fix tests for this PR

Uh oh!

RootMePLS commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teknium1 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RootMePLS commented May 18, 2026 •

edited

Loading