feat(api-server): support multimodal image uploads via auxiliary vision#12329
feat(api-server): support multimodal image uploads via auxiliary vision#12329sunxyless wants to merge 4 commits into
Conversation
Open WebUI and other OpenAI-compatible frontends send image attachments as multipart content with `image_url` parts. The existing content normalizer flattens multipart content to plain text and silently drops image parts, so images never reach the agent even though the CLI handles them fine. This change adds `_preprocess_message_images`, a pre-flatten hook run by both /v1/chat/completions and /v1/responses. For each message carrying image_url parts, it: - materializes any `data:` base64 URLs to a tempfile (Open WebUI format), - runs the auxiliary vision tool to describe each image, - inlines the description into the message text. Because the agent ultimately sees only text, this works uniformly across provider backends (Codex Responses, Anthropic, OpenAI-compatible, ...), mirroring the approach cli.py::_preprocess_images_with_vision already uses. Per-image errors become inline notes rather than failing the request. Also raises the default POST body limit from 1 MB to 25 MB and makes it configurable via API_SERVER_MAX_REQUEST_MB, since a single base64-encoded photo easily exceeds 1 MB. Tested on macOS 14 with Open WebUI + Codex (gpt-5.4) and auxiliary vision routed to Gemini Flash.
🛡️ Automated Security Review – Issues Found🔴 HIGH: SSRF VulnerabilityThe function passes remote values directly to without validation: Risk: Attackers can supply internal URLs (cloud metadata endpoints, internal services, file:// protocol) to probe the network or access sensitive metadata. Recommendation: Validate URL schemes (allow only http/https), sanitize inputs, and consider domain whitelisting or SSRF protection. 🟡 MEDIUM: Information DisclosureException messages are leaked to users: Risk: Raw exception strings may contain file paths, internal configuration, or stack traces visible to API consumers. Recommendation: Log full exception internally; return generic 'image processing failed' message to users. 🟡 MEDIUM: Insufficient File Validation
Recommendation: Validate file magic bytes after decode, enforce max decoded size per image. 🟢 POSITIVE: Good Practices
VERDICT: NEEDS CHANGES – Please address SSRF validation and info disclosure before merge. |
Addresses the automated security review on NousResearch#12329: 1. SSRF hardening — reject non-http(s) URL schemes (file://, ftp://, …) and run all http(s) URLs through tools.url_safety.is_safe_url before calling the vision tool. Fails closed if the SSRF filter module is unavailable. data: URLs are allowed (they stay local). 2. Information disclosure — exception messages and upstream error strings from vision_analyze_tool are now logged internally only; the user-facing / agent-visible output collapses to a neutral "[The user attached an image but it could not be processed.]" so internal paths / stack detail never flow into the LLM prompt. 3. File content validation — data: URL payloads are now: - base64-decoded under try/except (invalid payloads rejected), - bounded by a 15 MiB per-image cap (API_SERVER_MAX_REQUEST_BYTES covered request-level sizing; this adds per-image), - sniffed via magic bytes (PNG/JPEG/GIF/WebP); claimed Content-Type is advisory only. Unrecognized payloads → ValueError. Tests: - Added sniffer + URL-safety unit tests (magic byte detection, scheme rejection, fail-closed behavior, SSRF target rejection). - Updated vision-failure / exception-raise tests to assert details appear in operator logs but NOT in user output. - Added file:// "what's in my /etc/passwd" test to confirm unsafe URLs never reach the vision tool. 220 existing api_server tests still pass; 41 tests on the new helpers (up from 22).
|
Thanks for the thorough review — all three findings addressed in ebe06b2: 🔴 SSRF
🟡 Information Disclosure
🟡 Insufficient File ValidationNew
Test results
Also added a defensive Let me know if anything else needs tightening. |
Without an explicit ``client_max_size``, ``web.Application`` defaults to 1 MiB. ``body_limit_middleware`` was doing nothing useful — aiohttp silently truncated multimodal requests (base64-encoded images) at 1 MiB before our middleware could even inspect Content-Length, and the downstream ``request.json()`` raised, which the except handler collapsed into an opaque "Invalid JSON in request body" error. Pass ``client_max_size=MAX_REQUEST_BYTES`` so the middleware cap is actually authoritative. No behavior change for clients that stayed under 1 MiB; fixes silent rejection for anything larger. Discovered with Open WebUI + a 21 MB PNG. Smaller images happened to work because Open WebUI's client-side compression kept the final JSON body under 1 MiB. Excel uploads also worked because Open WebUI extracts the xlsx to text before forwarding to the chat completions endpoint — the body that reached api_server was tiny regardless of the original file size. Regression tests: - aiohttp's 1 MiB default is asserted (baseline). - Explicit ``client_max_size`` is honored (smoke-test the mechanism). - Source-level assertion that production path passes the matched limit. Also refactor two preprocessing tests to use a ``data:`` URL instead of ``https://example.com/...`` so they don't depend on ``tools.url_safety.is_safe_url`` being patchable under xdist parallel scheduling (intermittent failure when that module was imported by neighboring tests).
The per-image decoded-byte cap added in the security-review response was hardcoded to 15 MiB. That turns out to be too tight for legitimate use — a 21 MB PNG (e.g. a game map or high-res screenshot) decodes past the limit and the preprocessor rejects it, leaving the LLM to tell the user "I can't see the image." Raise the default to 50 MiB and surface it as ``API_SERVER_MAX_IMAGE_MB`` so operators can tune both bounds: - ``API_SERVER_MAX_REQUEST_MB`` caps the whole request (default 25) - ``API_SERVER_MAX_IMAGE_MB`` caps any single base64 image (default 50) Memory-bomb protection is preserved: operators who want the tighter 15 MiB behavior can set ``API_SERVER_MAX_IMAGE_MB=15``. Also rename the module constant from ``_MAX_IMAGE_BYTES`` (private) to ``MAX_IMAGE_BYTES`` to mirror ``MAX_REQUEST_BYTES`` and make the env knob easier to introspect from tests. Test additions parallel the ``_resolve_max_request_bytes`` suite: default, env override (integer / fractional), empty/whitespace, invalid, non-positive, module-constant sanity.
|
Superseded by #12969 (merged) + #8328 (body-size follow-up, still open). Closing. #12969's design — passing canonical multimodal structure straight through to the provider — is cleaner than my auxiliary-vision preprocessor approach, which would have been a pure fallback path for non-vision providers. Since Codex Responses + Anthropic both handle the canonical format natively via the converter in The parts of this PR that don't overlap with #12969 / #8328 are the per-image cap and the security-review hardening (SSRF filter, magic-byte validation, exception-message non-leakage), but those only made sense wrapped around the preprocessor. Happy to revive any of them as a separate, scoped PR if there's interest. Thanks for the quick rewrite on #12969 and for the nod to the other PRs in the credits. |
What does this PR do?
Fixes an asymmetry between the CLI and the OpenAI-compatible API server:
the CLI handles image attachments end-to-end via
_preprocess_images_with_vision(
cli.py:3666), but the API server silently drops them, so any multimodalfrontend (Open WebUI, LobeChat, LibreChat, …) loses images on the way to
the agent.
The API server currently receives OpenAI-style multi-part content like
[{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}]_normalize_chat_contentflattens these into a plain string and skipsimage_urlparts (see existing testtest_image_url_parts_silently_skipped),so images never reach the agent.
This PR adds
_preprocess_message_images, a pre-flatten hook mounted onboth
/v1/chat/completionsand/v1/responses. For each message carryingimage parts, it:
data:base64 URLs to a tempfile.vision_analyze_tool(auxiliary vision model — configured viaauxiliary.vision) to describe each image.Because the agent ultimately sees only text, this works uniformly across
every provider backend (Codex Responses, Anthropic, OpenAI-compatible, …)
without touching
run_agent.pyor the provider-specific serializationpaths. It mirrors what the CLI already does, so the behavior is consistent
between the two entry points.
Also raises the default POST body limit from 1 MB → 25 MB and makes it
configurable via
API_SERVER_MAX_REQUEST_MB, since a single base64-encodedphoto easily exceeds 1 MB.
Prior art / related PRs
I reviewed the open PRs in this area before submitting — this PR is
intentionally scoped narrowly:
enrichment approach but invokes its processor after the existing
normalize loop, at which point each message's
contentis already astring — so the multimodal handler is effectively dead code. This PR
fixes that by running the hook before normalization, adds tests,
and covers
/v1/responsesas well. Audio support is out of scope here— happy to revisit in a follow-up.
through to the provider. This PR takes a smaller, provider-agnostic
route (text enrichment) that ships today without changes to
run_agent.py; the two approaches can coexist (native for providersthat support it, enrichment as fallback).
scope (Anthropic path only). This PR covers all backends via the
auxiliary vision pipeline.
Related Issue
No existing issue. Happy to file one retroactively if preferred.
Type of Change
Changes Made
gateway/platforms/api_server.py_preprocess_message_images()+ helpers_split_content_partsand_materialize_data_url— walk message content, describe images viavision_analyze_tool, inline descriptions./v1/chat/completionsand/v1/responsesahead of the existing
_normalize_chat_contentstep.MAX_REQUEST_BYTESnow resolved via_resolve_max_request_bytes()(default 25 MB, override via
API_SERVER_MAX_REQUEST_MB).hermes_cli/config.pyAPI_SERVER_MAX_REQUEST_MBin the env-var catalog.tests/gateway/test_api_server_image_preprocessing.py— 14 new tests(helpers, tempfile cleanup, data-URL handling, multi-image, vision
failure/exception paths, missing
vision_analyze_tool).tests/gateway/test_api_server_max_request_bytes.py— 8 new testsfor the env-var resolver.
tests/gateway/test_api_server_normalize.py— comment on existingtest clarifying that
_normalize_chat_contentstill drops imageparts (callers now preprocess first).
How to Test
Unit tests:
pytest tests/gateway/test_api_server_image_preprocessing.py \ tests/gateway/test_api_server_max_request_bytes.py \ tests/gateway/test_api_server_normalize.py -v # 42 passed in 0.87sFull api_server suite (no regressions):
End-to-end with Open WebUI:
API_SERVER_ENABLED=true API_SERVER_KEY=... hermes gateway runand ask a question about it.
accordingly. Without this patch, the image part is silently dropped
and the model says "I don't see any image".
I verified the round-trip with Open WebUI → Codex
gpt-5.4+auxiliary vision routed to Gemini Flash on macOS 14.
Checklist
Code
pytest tests/gateway/test_api_server*.py -qand all tests pass (220 passed)Documentation & Housekeeping
config.pycatalog)hermes_cli/config.pyCONTRIBUTING.md/AGENTS.md— N/Atempfile+pathlib.Path; no Unix-only callsNotes for reviewers
that image parts in assistant/tool messages from a continuing
conversation are also described — matches
_normalize_chat_content'sown behavior of treating every message uniformly.
the whole request, so a flaky vision backend doesn't 500 the API.
tools.vision_toolsfails to import at all, the message content isleft as-is and downstream normalization drops the images — same
behavior as pre-patch, just with a log warning.