feat(api-server): support multimodal image uploads via auxiliary vision by sunxyless · Pull Request #12329 · NousResearch/hermes-agent

sunxyless · 2026-04-19T00:10:44Z

What does this PR do?

Fixes an asymmetry between the CLI and the OpenAI-compatible API server:
the CLI handles image attachments end-to-end via _preprocess_images_with_vision
(cli.py:3666), but the API server silently drops them, so any multimodal
frontend (Open WebUI, LobeChat, LibreChat, …) loses images on the way to
the agent.

The API server currently receives OpenAI-style multi-part content like

[{"type": "text", "text": "..."},
 {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}]

_normalize_chat_content flattens these into a plain string and skips
image_url parts (see existing test test_image_url_parts_silently_skipped),
so images never reach the agent.

This PR adds _preprocess_message_images, a pre-flatten hook mounted on
both /v1/chat/completions and /v1/responses. For each message carrying
image parts, it:

Materializes any data: base64 URLs to a tempfile.
Calls vision_analyze_tool (auxiliary vision model — configured via
auxiliary.vision) to describe each image.
Inlines the description into the message content as plain text.

Because the agent ultimately sees only text, this works uniformly across
every provider backend (Codex Responses, Anthropic, OpenAI-compatible, …)
without touching run_agent.py or the provider-specific serialization
paths. It mirrors what the CLI already does, so the behavior is consistent
between the two entry points.

Also raises the default POST body limit from 1 MB → 25 MB and makes it
configurable via API_SERVER_MAX_REQUEST_MB, since a single base64-encoded
photo easily exceeds 1 MB.

Prior art / related PRs

I reviewed the open PRs in this area before submitting — this PR is
intentionally scoped narrowly:

feat(api_server): multimodal content support (images + audio) #4046 (multimodal images + audio): takes the same vision-tool
enrichment approach but invokes its processor after the existing
normalize loop, at which point each message's content is already a
string — so the multimodal handler is effectively dead code. This PR
fixes that by running the hook before normalization, adds tests,
and covers /v1/responses as well. Audio support is out of scope here
— happy to revisit in a follow-up.
feat(gateway): add inline image inputs to API server #5621 (native image passthrough): threads image parts all the way
through to the provider. This PR takes a smaller, provider-agnostic
route (text enrichment) that ships today without changes to
run_agent.py; the two approaches can coexist (native for providers
that support it, enrichment as fallback).
fix(api_server): support OpenAI content array format for images in chat completions #8253 (list-content preservation for Anthropic fallback): narrower
scope (Anthropic path only). This PR covers all backends via the
auxiliary vision pipeline.

Related Issue

No existing issue. Happy to file one retroactively if preferred.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

gateway/platforms/api_server.py
- New _preprocess_message_images() + helpers _split_content_parts and
  _materialize_data_url — walk message content, describe images via
  vision_analyze_tool, inline descriptions.
- Wire the hook into both /v1/chat/completions and /v1/responses
  ahead of the existing _normalize_chat_content step.
- MAX_REQUEST_BYTES now resolved via _resolve_max_request_bytes()
  (default 25 MB, override via API_SERVER_MAX_REQUEST_MB).
hermes_cli/config.py
- Register API_SERVER_MAX_REQUEST_MB in the env-var catalog.
Tests
- tests/gateway/test_api_server_image_preprocessing.py — 14 new tests
  (helpers, tempfile cleanup, data-URL handling, multi-image, vision
  failure/exception paths, missing vision_analyze_tool).
- tests/gateway/test_api_server_max_request_bytes.py — 8 new tests
  for the env-var resolver.
- tests/gateway/test_api_server_normalize.py — comment on existing
  test clarifying that _normalize_chat_content still drops image
  parts (callers now preprocess first).

How to Test

Unit tests:

pytest tests/gateway/test_api_server_image_preprocessing.py \
       tests/gateway/test_api_server_max_request_bytes.py \
       tests/gateway/test_api_server_normalize.py -v
# 42 passed in 0.87s

Full api_server suite (no regressions):

pytest tests/gateway/test_api_server*.py -q
# 220 passed

End-to-end with Open WebUI:

API_SERVER_ENABLED=true API_SERVER_KEY=... hermes gateway run
In Open WebUI (or any OpenAI-compatible frontend), attach an image
and ask a question about it.
The agent receives a text description of the image and responds
accordingly. Without this patch, the image part is silently dropped
and the model says "I don't see any image".

I verified the round-trip with Open WebUI → Codex gpt-5.4 +
auxiliary vision routed to Gemini Flash on macOS 14.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits
I searched for existing PRs — see "Prior art" above
My PR contains only changes related to this feature
I've run pytest tests/gateway/test_api_server*.py -q and all tests pass (220 passed)
I've added tests for my changes
I've tested on my platform: macOS 14 (Sonoma)

Documentation & Housekeeping

Docstrings updated for the new functions (the env var description lives in the config.py catalog)
New config key registered in hermes_cli/config.py
CONTRIBUTING.md / AGENTS.md — N/A
Cross-platform: uses tempfile + pathlib.Path; no Unix-only calls
Tool descriptions/schemas — N/A

Notes for reviewers

The preprocessor walks every message (not just the last user message) so
that image parts in assistant/tool messages from a continuing
conversation are also described — matches _normalize_chat_content's
own behavior of treating every message uniformly.
On per-image vision failure we inline an error note rather than dropping
the whole request, so a flaky vision backend doesn't 500 the API.
If tools.vision_tools fails to import at all, the message content is
left as-is and downstream normalization drops the images — same
behavior as pre-patch, just with a log warning.

Open WebUI and other OpenAI-compatible frontends send image attachments as multipart content with `image_url` parts. The existing content normalizer flattens multipart content to plain text and silently drops image parts, so images never reach the agent even though the CLI handles them fine. This change adds `_preprocess_message_images`, a pre-flatten hook run by both /v1/chat/completions and /v1/responses. For each message carrying image_url parts, it: - materializes any `data:` base64 URLs to a tempfile (Open WebUI format), - runs the auxiliary vision tool to describe each image, - inlines the description into the message text. Because the agent ultimately sees only text, this works uniformly across provider backends (Codex Responses, Anthropic, OpenAI-compatible, ...), mirroring the approach cli.py::_preprocess_images_with_vision already uses. Per-image errors become inline notes rather than failing the request. Also raises the default POST body limit from 1 MB to 25 MB and makes it configurable via API_SERVER_MAX_REQUEST_MB, since a single base64-encoded photo easily exceeds 1 MB. Tested on macOS 14 with Open WebUI + Codex (gpt-5.4) and auxiliary vision routed to Gemini Flash.

mxnstrexgl · 2026-04-19T02:02:22Z

🛡️ Automated Security Review – Issues Found

🔴 HIGH: SSRF Vulnerability

The function passes remote values directly to without validation:

Risk: Attackers can supply internal URLs (cloud metadata endpoints, internal services, file:// protocol) to probe the network or access sensitive metadata.

Recommendation: Validate URL schemes (allow only http/https), sanitize inputs, and consider domain whitelisting or SSRF protection.

🟡 MEDIUM: Information Disclosure

Exception messages are leaked to users:

Risk: Raw exception strings may contain file paths, internal configuration, or stack traces visible to API consumers.

Recommendation: Log full exception internally; return generic 'image processing failed' message to users.

🟡 MEDIUM: Insufficient File Validation

URLs are decoded and written to disk based on user-controlled MIME type header
No magic bytes validation to verify actual file content matches claimed type
No per-image size limits within the request

Recommendation: Validate file magic bytes after decode, enforce max decoded size per image.

🟢 POSITIVE: Good Practices

Temp file cleanup in blocks
Environment-configurable request size limits (API_SERVER_MAX_REQUEST_MB)
Comprehensive test coverage including failure cases

VERDICT: NEEDS CHANGES – Please address SSRF validation and info disclosure before merge.

Addresses the automated security review on NousResearch#12329: 1. SSRF hardening — reject non-http(s) URL schemes (file://, ftp://, …) and run all http(s) URLs through tools.url_safety.is_safe_url before calling the vision tool. Fails closed if the SSRF filter module is unavailable. data: URLs are allowed (they stay local). 2. Information disclosure — exception messages and upstream error strings from vision_analyze_tool are now logged internally only; the user-facing / agent-visible output collapses to a neutral "[The user attached an image but it could not be processed.]" so internal paths / stack detail never flow into the LLM prompt. 3. File content validation — data: URL payloads are now: - base64-decoded under try/except (invalid payloads rejected), - bounded by a 15 MiB per-image cap (API_SERVER_MAX_REQUEST_BYTES covered request-level sizing; this adds per-image), - sniffed via magic bytes (PNG/JPEG/GIF/WebP); claimed Content-Type is advisory only. Unrecognized payloads → ValueError. Tests: - Added sniffer + URL-safety unit tests (magic byte detection, scheme rejection, fail-closed behavior, SSRF target rejection). - Updated vision-failure / exception-raise tests to assert details appear in operator logs but NOT in user output. - Added file:// "what's in my /etc/passwd" test to confirm unsafe URLs never reach the vision tool. 220 existing api_server tests still pass; 41 tests on the new helpers (up from 22).

sunxyless · 2026-04-19T03:05:59Z

Thanks for the thorough review — all three findings addressed in ebe06b2:

🔴 SSRF

New _is_safe_image_url() gate at the entry of _preprocess_message_images:
- rejects non-http(s) schemes (file://, ftp://, gopher://, javascript:, ldap://)
- routes all http(s) URLs through tools.url_safety.is_safe_url (the existing SSRF filter that blocks loopback, link-local, private ranges, and cloud-metadata endpoints — same guard vision_analyze_tool itself uses)
- fails closed: if tools.url_safety can't be imported, any http(s) URL is refused rather than silently bypassing the filter
data: URLs are still accepted (they stay local) but their content is validated separately — see below.

🟡 Information Disclosure

Per-image failure now emits a single, neutral user-facing note:
[The user attached an image but it could not be processed.]
Exception repr(), upstream result.get("analysis")/result.get("error") strings, and exc_info=True stack traces are logged at WARNING level for operators but never interpolated into the content that gets fed back to the agent.
Covered by two new tests that assert the detail appears in caplog and is absent from messages[0]["content"].

🟡 Insufficient File Validation

New _materialize_data_url behavior:

base64.b64decode(payload, validate=False) wrapped in try/except → ValueError("invalid base64 …") on malformed input.
New _MAX_IMAGE_BYTES = 15 * 1024 * 1024: per-image cap on the decoded payload, enforced before the tempfile write. Complements the request-level API_SERVER_MAX_REQUEST_MB (defaults 25 MiB, configurable).
New _sniff_image_mime() magic-byte check (PNG / JPEG / GIF / WebP). The caller-supplied MIME in the data: header is advisory only — the sniffed type decides the tempfile suffix. Unrecognized payloads → ValueError("data: URL payload is not a recognized image format").
Added tests for: non-image payload rejection, invalid-base64 rejection, oversize payload rejection, and "claimed JPEG but actually PNG" → suffix follows sniff.

Test results

tests/gateway/test_api_server_image_preprocessing.py: 39 passed (up from 22, 17 new)
tests/gateway/test_api_server*.py: 239 passed (no regressions)

Also added a defensive test_fails_closed_when_url_safety_module_missing covering the import-failure path and a test_ssrf_target_rejected that monkeypatches is_safe_url to False to exercise the private-IP branch deterministically.

Let me know if anything else needs tightening.

Without an explicit ``client_max_size``, ``web.Application`` defaults to 1 MiB. ``body_limit_middleware`` was doing nothing useful — aiohttp silently truncated multimodal requests (base64-encoded images) at 1 MiB before our middleware could even inspect Content-Length, and the downstream ``request.json()`` raised, which the except handler collapsed into an opaque "Invalid JSON in request body" error. Pass ``client_max_size=MAX_REQUEST_BYTES`` so the middleware cap is actually authoritative. No behavior change for clients that stayed under 1 MiB; fixes silent rejection for anything larger. Discovered with Open WebUI + a 21 MB PNG. Smaller images happened to work because Open WebUI's client-side compression kept the final JSON body under 1 MiB. Excel uploads also worked because Open WebUI extracts the xlsx to text before forwarding to the chat completions endpoint — the body that reached api_server was tiny regardless of the original file size. Regression tests: - aiohttp's 1 MiB default is asserted (baseline). - Explicit ``client_max_size`` is honored (smoke-test the mechanism). - Source-level assertion that production path passes the matched limit. Also refactor two preprocessing tests to use a ``data:`` URL instead of ``https://example.com/...`` so they don't depend on ``tools.url_safety.is_safe_url`` being patchable under xdist parallel scheduling (intermittent failure when that module was imported by neighboring tests).

The per-image decoded-byte cap added in the security-review response was hardcoded to 15 MiB. That turns out to be too tight for legitimate use — a 21 MB PNG (e.g. a game map or high-res screenshot) decodes past the limit and the preprocessor rejects it, leaving the LLM to tell the user "I can't see the image." Raise the default to 50 MiB and surface it as ``API_SERVER_MAX_IMAGE_MB`` so operators can tune both bounds: - ``API_SERVER_MAX_REQUEST_MB`` caps the whole request (default 25) - ``API_SERVER_MAX_IMAGE_MB`` caps any single base64 image (default 50) Memory-bomb protection is preserved: operators who want the tighter 15 MiB behavior can set ``API_SERVER_MAX_IMAGE_MB=15``. Also rename the module constant from ``_MAX_IMAGE_BYTES`` (private) to ``MAX_IMAGE_BYTES`` to mirror ``MAX_REQUEST_BYTES`` and make the env knob easier to introspect from tests. Test additions parallel the ``_resolve_max_request_bytes`` suite: default, env override (integer / fractional), empty/whitespace, invalid, non-positive, module-constant sanity.

sunxyless · 2026-04-21T13:18:41Z

Superseded by #12969 (merged) + #8328 (body-size follow-up, still open). Closing.

#12969's design — passing canonical multimodal structure straight through to the provider — is cleaner than my auxiliary-vision preprocessor approach, which would have been a pure fallback path for non-vision providers. Since Codex Responses + Anthropic both handle the canonical format natively via the converter in run_agent.py, there's no remaining use case for the preprocessor.

The parts of this PR that don't overlap with #12969 / #8328 are the per-image cap and the security-review hardening (SSRF filter, magic-byte validation, exception-message non-leakage), but those only made sense wrapped around the preprocessor. Happy to revive any of them as a separate, scoped PR if there's interest.

Thanks for the quick rewrite on #12969 and for the nod to the other PRs in the credits.

sunxyless added 2 commits April 21, 2026 20:46

sunxyless closed this Apr 21, 2026

sunxyless mentioned this pull request Apr 21, 2026

fix(gateway): make api_server body-size cap configurable, raise default to 100 MiB #8328

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api-server): support multimodal image uploads via auxiliary vision#12329

feat(api-server): support multimodal image uploads via auxiliary vision#12329
sunxyless wants to merge 4 commits into
NousResearch:mainfrom
sunxyless:feat/api-server-image-uploads

sunxyless commented Apr 19, 2026

Uh oh!

mxnstrexgl commented Apr 19, 2026

Uh oh!

sunxyless commented Apr 19, 2026

Uh oh!

sunxyless commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunxyless commented Apr 19, 2026

What does this PR do?

Prior art / related PRs

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Notes for reviewers

Uh oh!

mxnstrexgl commented Apr 19, 2026

🛡️ Automated Security Review – Issues Found

🔴 HIGH: SSRF Vulnerability

🟡 MEDIUM: Information Disclosure

🟡 MEDIUM: Insufficient File Validation

🟢 POSITIVE: Good Practices

Uh oh!

sunxyless commented Apr 19, 2026

🔴 SSRF

🟡 Information Disclosure

🟡 Insufficient File Validation

Test results

Uh oh!

sunxyless commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants