feat(agent): single-knob native vision for custom-provider models#29679
Merged
Conversation
Custom/local provider models absent from models.dev get classified as
non-vision and have their image content stripped before reaching the
upstream API. Surface a user-facing override:
model:
supports_vision: true
providers:
my-vllm:
models:
my-llava:
supports_vision: true
The override short-circuits the models.dev lookup in
_model_supports_vision(), which is the single gate guarding image-strip
preprocessing on every transport path.
Refs #8731.
Named custom providers are rewritten to provider="custom" at runtime (hermes_cli/runtime_provider.py:_resolve_named_custom_runtime), so a config under providers.my-vllm.models.my-llava.supports_vision was unreachable via self.provider alone. Also try cfg.model.provider as a candidate provider key, covering both runtime and config naming. Adds a regression test for the named-provider path.
The contributor PR (#17936) only patched the strip path in `_model_supports_vision()`. The auto-mode router in `agent/image_routing._lookup_supports_vision` still only read models.dev, so a custom-provider model declared as vision-capable would still get its images routed through vision_analyze in the default `agent.image_input_mode: auto` setting. Users had to set both `supports_vision: true` AND `image_input_mode: native` to bypass the text pipeline. Single-knob behavior now: `supports_vision: true` alone is enough in auto mode. The strip path and the routing path consult the same resolver. - Extract override resolution into `_supports_vision_override()` in agent/image_routing.py and wire it into `_lookup_supports_vision()`. - Refactor `run_agent._model_supports_vision` to call the same helper (DRY, single source of truth for the resolution order). - Strict YAML boolean coercion: `supports_vision: "false"` (quoted — a common YAML mistake) no longer coerces to True via bool() truthiness. Recognised tokens: true/false/yes/no/on/off/1/0 plus real bools and 0/1. Unrecognised values return None and fall through to models.dev. - Add @CNSeniorious000 to AUTHOR_MAP for release attribution. Tests: 26 new (TestCoerceCapabilityBool, TestSupportsVisionOverride, TestLookupSupportsVisionOverride, TestAutoModeRespectsOverride). Existing contributor tests + image_routing + vision_native_fast_path + native_image_buffer_isolation all green (92/92).
The interactive CLI input path consults decide_image_input_mode() to pick between native image_url attachment and the vision_analyze text pipeline, but the non-interactive 'hermes chat -Q -q ... --image FOO' path unconditionally called _preprocess_images_with_vision() — so even with `model.supports_vision: true` set, --image always went through the text-pipeline. Symptom: vision_analyze runs 4-5s per image and the model sees a lossy text summary instead of the actual pixels. Mirror the interactive path: load config, call decide_image_input_mode, branch on native vs text. Falls back to the text-pipeline on any import or build error (Pyright-clean: _build_parts guarded with `is not None`). Live E2E (provider=custom, base_url=openrouter, anthropic/claude-haiku-4.5, red 64x64 PNG): baseline (no override): vision_analyze called (8 log lines), 5.8s with supports_vision: vision_analyze NOT called (0 log lines), 3.9s Same model, same image, single knob flips text→native routing.
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
invalid-argument-type |
2 |
First entries
cli.py:14490: [invalid-argument-type] invalid-argument-type: Argument to bound method `AIAgent.run_conversation` is incorrect: Expected `str`, found `Any | list[dict[str, Any]] | str`
cli.py:14474: [invalid-argument-type] invalid-argument-type: Argument to bound method `HermesCLI._resolve_turn_agent_config` is incorrect: Expected `str`, found `Any | list[dict[str, Any]] | str`
✅ Fixed issues: none
Unchanged: 4741 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
17 tasks
3 tasks
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvages and completes #17936 by @CNSeniorious000.
Summary
Setting
model.supports_vision: trueon a custom-provider model now routes attached images natively (asimage_urlparts the model sees as pixels) end-to-end. Single config knob — no need to also pinagent.image_input_mode: native.Motivating case: Qwen3.6-35B-A3B served by local llama.cpp via
provider: custom. The model is image-capable but absent from models.dev, so Hermes was running every attached image throughvision_analyzefirst and feeding the main model a lossy text description.Changes
agent/image_routing.py— extract_supports_vision_override()resolver (top-level shortcut → named-provider per-model → models.dev fallback), wire it into_lookup_supports_vision()so auto-mode routing respects the override. Strict YAML bool coercion (recognisestrue/false/yes/no/on/off/1/0;bool("false") == Trueno longer leaks through). Handles named-custom-provider runtime/config disambiguation (self.provider == "custom"vscfg.model.provider == "my-vllm").run_agent.py— refactor_model_supports_visionto call the shared helper (single source of truth for the strip path and the routing path).cli.py— quiet-mode-Q -q --imagepath now consultsdecide_image_input_mode()instead of unconditionally calling the text-pipeline (mirrors the interactive path).scripts/release.py— AUTHOR_MAP entry for @CNSeniorious000.tests/agent/test_image_routing.py— 26 new tests acrossTestCoerceCapabilityBool,TestSupportsVisionOverride,TestLookupSupportsVisionOverride,TestAutoModeRespectsOverride.Validation
supports_vision: truechat -Q -q --image(haiku-4.5 via OR-as-custom, 64x64 red PNG)vision_analyzecalled (8 log lines), 5.8s, text-pipeline replyvision_analyzeNOT called, 3.9s, nativeimage_url→ "red"Credit
@CNSeniorious000 wrote the strip-path fix (commits 1 & 2). His PR body flagged the routing-side gap as out-of-scope and offered to extend it — taking him up on the offer with the remaining work. Authorship preserved per-commit via rebase merge.
Closes #17936.