Skip to content

Fix/computer use aux vision routing#24070

Closed
xxxigm wants to merge 4 commits into
NousResearch:mainfrom
xxxigm:fix/computer-use-aux-vision-routing
Closed

Fix/computer use aux vision routing#24070
xxxigm wants to merge 4 commits into
NousResearch:mainfrom
xxxigm:fix/computer-use-aux-vision-routing

Conversation

@xxxigm

@xxxigm xxxigm commented May 12, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #24015. The computer_use tool's capture action (mode='som' / mode='vision') used to always return a _multimodal envelope containing the screenshot, which was then delivered to the active main session model as the tool result. When the active main model has no vision capability — or when the user explicitly configured auxiliary.vision in config.yaml — that envelope tripped HTTP 404 / 400 at the provider boundary (e.g. No endpoints found that support image input) and the agent loop reported a hard tool failure.

Reporter's repro:

model:
  default: tencent/hy3-preview      # no vision support
  provider: openrouter
auxiliary:
  vision:
    provider: openrouter
    model: google/gemini-2.5-flash  # explicitly configured, never used
computer_use(action='capture', mode='som')
→ ⚠️ API call failed (attempt1/3): NotFoundError [HTTP 404]
   🔌 Provider: openrouter  Model: tencent/hy3-preview
   📝 Error: HTTP 404: No endpoints found that support image input

This PR adds the same routing policy that vision_analyze already uses for user-attached images: when an explicit auxiliary.vision block is set, OR the active main+provider can't carry images inside tool-result messages, OR the main model reports no vision capability, the captured PNG is materialised under $HERMES_HOME/cache/vision/, handed to vision_analyze_tool (which honours auxiliary.vision via the standard async_call_llm(task='vision', ...) router), and the result is returned as a text-only JSON tool message that embeds the analysis alongside the existing AX/SOM index. The main model never sees the pixels — it sees an actionable text description plus the same set-of-mark element index it normally uses.

The decision deliberately fails open: any failure (config import, helper exception, aux LLM crash, empty analysis) falls back to the legacy multimodal envelope. The temp screenshot file is unlinked unconditionally in a finally block.

Related Issue

Fixes #24015

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

Four commits, fix-test alternation, +992 lines, −0 lines:

  1. fix(computer_use): add helper to decide capture vision routing — new module tools/computer_use/vision_routing.py (+152 lines) with should_route_capture_to_aux_vision(provider, model, cfg) plus the small lookup helpers it composes. Mirrors agent.image_routing.decide_image_input_mode so capture-routing and user-attached-image routing agree on what counts as an explicit aux override.
  2. test(computer_use): cover capture vision-routing helpertests/tools/test_computer_use_vision_routing.py (+260 lines, 28 unit tests) pinning the helper's contract: explicit override detection (12 cases), policy decision (7 cases), defensive lookups (5 cases), public surface (4 cases).
  3. fix(computer_use): route SOM/vision captures via auxiliary.vision (#24015)tools/computer_use/tool.py (+149 lines): adds _should_route_through_aux_vision() (reads main provider/model + config, asks the helper, fails open) and _route_capture_through_aux_vision(cap, summary) (decodes the base64 PNG to $HERMES_HOME/cache/vision/, runs vision_analyze_tool via model_tools._run_async, returns a JSON text response). Wires both into _capture_response() ahead of the existing multimodal envelope branch.
  4. test(computer_use): end-to-end regression for capture routing (#24015)tests/tools/test_computer_use_capture_routing.py (+431 lines, 13 integration tests) driving _capture_response end-to-end with deterministic stubs: default native path (3), routed-to-aux path including temp-file cleanup on success/failure (5), routing-decision wiring with real config plumbing (4), and a bug-reproduction anchor that asserts the response never contains data:image / image_url when routing is on (1).

How to Test

  1. Check out this branch and ensure .venv is set up.

  2. Run the fix's full test surface (helper + integration + existing computer_use suite):

    scripts/run_tests.sh tests/tools/test_computer_use.py \
      tests/tools/test_computer_use_vision_routing.py \
      tests/tools/test_computer_use_capture_routing.py
    

    Expected: 85 passed (44 pre-existing in test_computer_use.py + 28 new helper tests + 13 new integration tests).

  3. Bug-reproduction proof — without the production fix, every integration test fails:

    $ git checkout upstream/main -- tools/computer_use/tool.py
    $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py
    ============================== 13 failed in 1.29s ==============================
    
    $ git checkout HEAD -- tools/computer_use/tool.py    # restore the fix
    $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py
    ============================== 13 passed in 1.04s ==============================
    
  4. Optional — manual repro on macOS with the reporter's config:

    • Set model.default: tencent/hy3-preview (or any non-vision OpenRouter model) and auxiliary.vision.{provider,model}: openrouter / google/gemini-2.5-flash.
    • Trigger computer_use action=capture mode=som.
    • Before this PR: HTTP 404 No endpoints found that support image input on the next agent turn.
    • After this PR: tool returns a JSON text result with vision_analysis_routed_via: "auxiliary.vision" and the main model receives a description it can act on.

Checklist

Code

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A; module-level docstrings in the two new files document the policy in detail
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config keys; reuses existing auxiliary.vision)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guidecomputer_use is macOS-only via cua-driver; the fix path (text temp file under $HERMES_HOME/cache/vision/, Path operations) is portable
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A; tool result shape is still either _multimodal envelope or JSON string, matching the documented contract

Screenshots / Logs

$ scripts/run_tests.sh tests/tools/test_computer_use.py \
    tests/tools/test_computer_use_vision_routing.py \
    tests/tools/test_computer_use_capture_routing.py
4 workers [85 items]
============================== 85 passed in 1.49s ==============================

$ .venv/bin/ruff check tools/computer_use/ \
    tests/tools/test_computer_use_vision_routing.py \
    tests/tools/test_computer_use_capture_routing.py
All checks passed!

Notes for reviewers

  • The new helper deliberately mirrors agent.image_routing._explicit_aux_vision_override so the capture path and the user-attached-image path stay in lockstep. If the project ever changes how it detects an explicit auxiliary.vision override, both call sites will need to move together — the lockstep is intentional, not a copy-paste mistake.
  • The fix is invoked exclusively from _capture_response() (and transitively _maybe_follow_capture() via _capture_response()). The vision and som modes both go through this path; ax mode is unchanged because no PNG is ever produced.
  • The aux call is intentionally synchronous from the tool's perspective — model_tools._run_async is the project-standard sync→async bridge already used by every other handler that needs to call an async helper from a sync handle_* registration. No new bridge code is introduced.

@alt-glitch alt-glitch added type/bug Something isn't working comp/tools Tool registry, model_tools, toolsets tool/vision Vision analysis and image generation P2 Medium — degraded but workaround exists labels May 12, 2026
@RootMePLS

Copy link
Copy Markdown

It seems that the tests have not passed

xxxigm added 4 commits May 15, 2026 23:44
Add tools/computer_use/vision_routing.py with
should_route_capture_to_aux_vision(provider, model, cfg) — a small
policy helper that decides whether a captured screenshot should be
returned as a multimodal envelope (main model has native vision) or
pre-analysed through the auxiliary.vision pipeline so the main model
only sees text.

The decision mirrors agent.image_routing.decide_image_input_mode for
user-attached images, so the capture path and the user-turn path agree
on what counts as an explicit aux vision override:
  * provider/model/base_url under auxiliary.vision => explicit override
    => route through aux vision
  * provider+model accepts multimodal tool results AND main model
    reports supports_vision=True => keep multimodal envelope
  * everything else (no tool-result image support, non-vision model,
    metadata lookup failure) => fail closed and route through aux

No call sites are changed in this commit; the helper is added in
isolation so the routing decision can be unit-tested before it is
plumbed into _capture_response().
Add tests/tools/test_computer_use_vision_routing.py — 28 unit tests
that pin the contract of the new vision-routing helper introduced in
the previous commit:

  * TestExplicitAuxVisionOverride (12 cases): mirror the
    auxiliary.vision detection rules used by agent.image_routing so
    the capture path and the user-attached-image path agree on what
    counts as an explicit override (provider/model/base_url with
    non-blank, non-'auto' values).
  * TestRouteDecision (7 cases): pin the policy itself — explicit
    override always wins, vision-capable + native-tool-result keeps
    multimodal, everything else fails closed and routes to aux.
  * TestLookupHelpers (5 cases): defensive paths for the models.dev /
    tool-result-support lookups (blank inputs, exceptions, missing
    caps).
  * TestModuleSurface (4 cases): pin the public/__all__ surface and
    keep internal helpers addressable so the integration test in the
    next commit can monkeypatch them deterministically.

Run with:
  scripts/run_tests.sh tests/tools/test_computer_use_vision_routing.py
…usResearch#24015)

When the active main model has no vision capability — or when the user
explicitly configured auxiliary.vision in config.yaml — sending the
captured screenshot back to the main model in a multimodal tool-result
envelope is the wrong move: it trips HTTP 404 / 400 at the provider
boundary (e.g. 'No endpoints found that support image input') and the
agent loop reports a hard tool failure for what should have been a
simple capture.

The reporter on NousResearch#24015 hit this with:

  model:
    default: tencent/hy3-preview      # no vision support
    provider: openrouter
  auxiliary:
    vision:
      provider: openrouter
      model: google/gemini-2.5-flash  # explicitly configured

…and observed:

  computer_use(action='capture', mode='som')
  → ⚠️ API call failed (attempt1/3): NotFoundError [HTTP 404]
     🔌 Provider: openrouter  Model: tencent/hy3-preview
     📝 Error: HTTP 404: No endpoints found that support image input

Fix: in tools/computer_use/tool.py::_capture_response, after a
screenshot is captured (modes 'som' / 'vision'), consult the routing
helper introduced earlier in this branch. When it says 'route to aux',
materialise the PNG to $HERMES_HOME/cache/vision/, run vision_analyze
on it (which honours auxiliary.vision via the standard async_call_llm
task='vision' router), and return a text-only JSON tool result that
embeds the analysis alongside the existing AX/SOM index. The main
model never sees the pixels — it sees an actionable text description
plus the same set-of-mark element index it normally uses.

The two new helpers (_should_route_through_aux_vision,
_route_capture_through_aux_vision) keep the policy and the IO
separated so each can be tested in isolation. Both fail open: if the
config import fails, if the aux call raises, or if the analysis is
empty, we fall back to the existing multimodal envelope so the
behaviour is at worst the pre-fix status quo. Temp screenshot files
are cleaned up unconditionally in a finally block — even on aux call
failure — to avoid leaving residue under cache/vision/.

The end-to-end regression for NousResearch#24015 is added in the next commit.
…search#24015)

Add tests/tools/test_computer_use_capture_routing.py — 13 integration
tests that drive _capture_response end-to-end with deterministic stubs
for the routing helper, _run_async, vision_analyze_tool, and
get_hermes_dir, so the full code path is exercised without a live
cua-driver, real auxiliary client, or network access.

Coverage:

  * TestCaptureResponseDefaultPath (3 cases)
    - SOM PNG capture returns the legacy multimodal envelope when the
      routing helper says 'native' (image/png MIME).
    - Same path returns image/jpeg MIME for JPEG payloads (cua-driver
      can return either).
    - AX-only mode never even consults the routing helper because no
      PNG is present.

  * TestCaptureResponseRoutedToAuxVision (5 cases)
    - SOM capture with routing on returns a JSON string with the
      vision_analysis embedded, the AX/SOM index preserved, and NO
      image_url parts. Verifies the aux call receives a path under
      the configured cache and a prompt that grounds itself against
      the AX summary.
    - Temp screenshot file is unlinked after _capture_response returns,
      including when the aux call raises (the finally block runs).
    - Empty / malformed aux analysis falls back to the multimodal
      envelope so the user always gets *something* useful.

  * TestRoutingDecisionWiring (4 cases)
    - Explicit auxiliary.vision in config flips routing on regardless of
      main-model vision capability.
    - Vision-capable main + native tool-result support keeps multimodal.
    - Config load failure fails open (returns False, multimodal path
      continues to work).
    - Helper exception is swallowed and routes to legacy behaviour.

  * TestBugReproductionAnchor (1 case) - directly pins the NousResearch#24015
    contract: when routing is on, the response must NEVER contain a
    'data:image' or 'image_url' substring. That is exactly what tripped
    the reporter's HTTP 404 ('No endpoints found that support image
    input') on tencent/hy3-preview before the fix.

Bug-reproduction proof:
  $ git checkout upstream/main -- tools/computer_use/tool.py
  $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py
  ============================== 13 failed in 1.29s ==============================

  $ # restore tool.py to this branch's HEAD
  $ scripts/run_tests.sh tests/tools/test_computer_use_capture_routing.py
  ============================== 13 passed in 1.04s ==============================

Total branch coverage:
  85 passed across test_computer_use.py, test_computer_use_vision_routing.py,
  test_computer_use_capture_routing.py
@xxxigm xxxigm force-pushed the fix/computer-use-aux-vision-routing branch from bfb44ff to 9365234 Compare May 15, 2026 16:45
@xxxigm

xxxigm commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

@RootMePLS Thanks for flagging! The earlier failures were stale-base issues. I've now rebased the branch onto current upstream/main (HEAD 9fb40e6a3) and force-pushed — clean rebase, no conflicts, all 4 fix-test alternation commits preserved.

After rebase, all checks that depend on this branch's code pass:

Check Before rebase After rebase
Windows footguns (blocking) ❌ FAILURE (stale base, tools/process_registry.py:588 — branch never touched this file) ✅ SUCCESS
ruff enforcement (blocking)
ruff + ty diff
Check PyPI dependency upper bounds n/a
nix (ubuntu-latest)
nix (macos-latest)
build-arm64
Supply Chain Audit

The remaining failures are all pre-existing on upstream/main, not caused by this branch:

  1. test job — 3 failures in tests/run_agent/test_provider_parity.py (Nous Portal context-length validation):

    FAILED TestDeveloperRoleSwap::test_developer_role_via_nous_portal
    FAILED TestBuildApiKwargsNousPortal::test_includes_nous_product_tags
    FAILED TestBuildApiKwargsNousPortal::test_uses_chat_completions_format
    → ValueError: Model has a context window of 15,000 tokens, below minimum 64,000
    

    Same 3 failures on upstream/main HEAD 9fb40e6a3 (run 25923943151 / job 76200003383). Nothing to do with tools/computer_use/.

  2. e2e job — 4 failures in tests/e2e/test_discord_adapter.py (Discord mock SimpleNamespace missing history, discord.Forbidden not a real exception class in tests). Same 4 failures on upstream/main HEAD 9fb40e6a3. Introduced by recent Discord history-backfill PRs (#4abfb6bc2, #e84fe483b), unrelated to capture routing.

  3. build-amd64 — Docker runner disk space: no space left on device writing lightningcss-linux-x64-gnu.node. Infra issue — build-arm64 (same commit) succeeded.

Bug-fix tests for this PR

The fix's own test suite passes locally and in CI lint:

$ scripts/run_tests.sh tests/tools/test_computer_use.py \
    tests/tools/test_computer_use_vision_routing.py \
    tests/tools/test_computer_use_capture_routing.py
============================== 88 passed in 1.43s ==============================
  • 47 pre-existing test_computer_use.py tests — still green
  • 28 new helper tests (test_computer_use_vision_routing.py)
  • 13 new integration tests (test_computer_use_capture_routing.py)

Bug-reproduction proof still holds: revert tools/computer_use/tool.py to upstream/main and all 13 integration tests fail with assert "data:image" not in resp etc.

Happy to address any code feedback — the failing checks are all environmental / pre-existing, not signal on the fix itself.

@RootMePLS

RootMePLS commented May 18, 2026

Copy link
Copy Markdown

@xxxigm thank you for working on this!
It seems not all tests passed.

@teknium1

Copy link
Copy Markdown
Contributor

Salvaged via PR #30126 (commit bec2250 on main). Your 4 commits were cherry-picked onto current main with your authorship preserved. Thanks for the thorough fix — 41 new tests + bug-repro anchor + fail-open semantics made this a clean review. Closes #24015 and #29407 (dup).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists tool/vision Vision analysis and image generation type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cua-driver ignores auxiliary.vision config, uses main session model for image analysis

4 participants