Skip to content

feat(api-server): support multimodal image uploads via auxiliary vision#12329

Closed
sunxyless wants to merge 4 commits into
NousResearch:mainfrom
sunxyless:feat/api-server-image-uploads
Closed

feat(api-server): support multimodal image uploads via auxiliary vision#12329
sunxyless wants to merge 4 commits into
NousResearch:mainfrom
sunxyless:feat/api-server-image-uploads

Conversation

@sunxyless

Copy link
Copy Markdown

What does this PR do?

Fixes an asymmetry between the CLI and the OpenAI-compatible API server:
the CLI handles image attachments end-to-end via _preprocess_images_with_vision
(cli.py:3666), but the API server silently drops them, so any multimodal
frontend (Open WebUI, LobeChat, LibreChat, …) loses images on the way to
the agent.

The API server currently receives OpenAI-style multi-part content like

[{"type": "text", "text": "..."},
 {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}]

_normalize_chat_content flattens these into a plain string and skips
image_url parts (see existing test test_image_url_parts_silently_skipped),
so images never reach the agent.

This PR adds _preprocess_message_images, a pre-flatten hook mounted on
both /v1/chat/completions and /v1/responses. For each message carrying
image parts, it:

  1. Materializes any data: base64 URLs to a tempfile.
  2. Calls vision_analyze_tool (auxiliary vision model — configured via
    auxiliary.vision) to describe each image.
  3. Inlines the description into the message content as plain text.

Because the agent ultimately sees only text, this works uniformly across
every provider backend (Codex Responses, Anthropic, OpenAI-compatible, …)
without touching run_agent.py or the provider-specific serialization
paths. It mirrors what the CLI already does, so the behavior is consistent
between the two entry points.

Also raises the default POST body limit from 1 MB → 25 MB and makes it
configurable via API_SERVER_MAX_REQUEST_MB, since a single base64-encoded
photo easily exceeds 1 MB.

Prior art / related PRs

I reviewed the open PRs in this area before submitting — this PR is
intentionally scoped narrowly:

  • feat(api_server): multimodal content support (images + audio) #4046 (multimodal images + audio): takes the same vision-tool
    enrichment approach but invokes its processor after the existing
    normalize loop, at which point each message's content is already a
    string — so the multimodal handler is effectively dead code. This PR
    fixes that by running the hook before normalization, adds tests,
    and covers /v1/responses as well. Audio support is out of scope here
    — happy to revisit in a follow-up.
  • feat(gateway): add inline image inputs to API server #5621 (native image passthrough): threads image parts all the way
    through to the provider. This PR takes a smaller, provider-agnostic
    route (text enrichment) that ships today without changes to
    run_agent.py; the two approaches can coexist (native for providers
    that support it, enrichment as fallback).
  • fix(api_server): support OpenAI content array format for images in chat completions #8253 (list-content preservation for Anthropic fallback): narrower
    scope (Anthropic path only). This PR covers all backends via the
    auxiliary vision pipeline.

Related Issue

No existing issue. Happy to file one retroactively if preferred.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/api_server.py
    • New _preprocess_message_images() + helpers _split_content_parts and
      _materialize_data_url — walk message content, describe images via
      vision_analyze_tool, inline descriptions.
    • Wire the hook into both /v1/chat/completions and /v1/responses
      ahead of the existing _normalize_chat_content step.
    • MAX_REQUEST_BYTES now resolved via _resolve_max_request_bytes()
      (default 25 MB, override via API_SERVER_MAX_REQUEST_MB).
  • hermes_cli/config.py
    • Register API_SERVER_MAX_REQUEST_MB in the env-var catalog.
  • Tests
    • tests/gateway/test_api_server_image_preprocessing.py — 14 new tests
      (helpers, tempfile cleanup, data-URL handling, multi-image, vision
      failure/exception paths, missing vision_analyze_tool).
    • tests/gateway/test_api_server_max_request_bytes.py — 8 new tests
      for the env-var resolver.
    • tests/gateway/test_api_server_normalize.py — comment on existing
      test clarifying that _normalize_chat_content still drops image
      parts (callers now preprocess first).

How to Test

Unit tests:

pytest tests/gateway/test_api_server_image_preprocessing.py \
       tests/gateway/test_api_server_max_request_bytes.py \
       tests/gateway/test_api_server_normalize.py -v
# 42 passed in 0.87s

Full api_server suite (no regressions):

pytest tests/gateway/test_api_server*.py -q
# 220 passed

End-to-end with Open WebUI:

  1. API_SERVER_ENABLED=true API_SERVER_KEY=... hermes gateway run
  2. In Open WebUI (or any OpenAI-compatible frontend), attach an image
    and ask a question about it.
  3. The agent receives a text description of the image and responds
    accordingly. Without this patch, the image part is silently dropped
    and the model says "I don't see any image".

I verified the round-trip with Open WebUI → Codex gpt-5.4 +
auxiliary vision routed to Gemini Flash on macOS 14.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs — see "Prior art" above
  • My PR contains only changes related to this feature
  • I've run pytest tests/gateway/test_api_server*.py -q and all tests pass (220 passed)
  • I've added tests for my changes
  • I've tested on my platform: macOS 14 (Sonoma)

Documentation & Housekeeping

  • Docstrings updated for the new functions (the env var description lives in the config.py catalog)
  • New config key registered in hermes_cli/config.py
  • CONTRIBUTING.md / AGENTS.md — N/A
  • Cross-platform: uses tempfile + pathlib.Path; no Unix-only calls
  • Tool descriptions/schemas — N/A

Notes for reviewers

  • The preprocessor walks every message (not just the last user message) so
    that image parts in assistant/tool messages from a continuing
    conversation are also described — matches _normalize_chat_content's
    own behavior of treating every message uniformly.
  • On per-image vision failure we inline an error note rather than dropping
    the whole request, so a flaky vision backend doesn't 500 the API.
  • If tools.vision_tools fails to import at all, the message content is
    left as-is and downstream normalization drops the images — same
    behavior as pre-patch, just with a log warning.

Open WebUI and other OpenAI-compatible frontends send image attachments as
multipart content with `image_url` parts. The existing content normalizer
flattens multipart content to plain text and silently drops image parts, so
images never reach the agent even though the CLI handles them fine.

This change adds `_preprocess_message_images`, a pre-flatten hook run by
both /v1/chat/completions and /v1/responses. For each message carrying
image_url parts, it:

- materializes any `data:` base64 URLs to a tempfile (Open WebUI format),
- runs the auxiliary vision tool to describe each image,
- inlines the description into the message text.

Because the agent ultimately sees only text, this works uniformly across
provider backends (Codex Responses, Anthropic, OpenAI-compatible, ...),
mirroring the approach cli.py::_preprocess_images_with_vision already uses.
Per-image errors become inline notes rather than failing the request.

Also raises the default POST body limit from 1 MB to 25 MB and makes it
configurable via API_SERVER_MAX_REQUEST_MB, since a single base64-encoded
photo easily exceeds 1 MB.

Tested on macOS 14 with Open WebUI + Codex (gpt-5.4) and auxiliary vision
routed to Gemini Flash.
@mxnstrexgl

Copy link
Copy Markdown

🛡️ Automated Security Review – Issues Found

🔴 HIGH: SSRF Vulnerability

The function passes remote values directly to without validation:

Risk: Attackers can supply internal URLs (cloud metadata endpoints, internal services, file:// protocol) to probe the network or access sensitive metadata.

Recommendation: Validate URL schemes (allow only http/https), sanitize inputs, and consider domain whitelisting or SSRF protection.

🟡 MEDIUM: Information Disclosure

Exception messages are leaked to users:

Risk: Raw exception strings may contain file paths, internal configuration, or stack traces visible to API consumers.

Recommendation: Log full exception internally; return generic 'image processing failed' message to users.

🟡 MEDIUM: Insufficient File Validation

  • URLs are decoded and written to disk based on user-controlled MIME type header
  • No magic bytes validation to verify actual file content matches claimed type
  • No per-image size limits within the request

Recommendation: Validate file magic bytes after decode, enforce max decoded size per image.

🟢 POSITIVE: Good Practices

  • Temp file cleanup in blocks
  • Environment-configurable request size limits (API_SERVER_MAX_REQUEST_MB)
  • Comprehensive test coverage including failure cases

VERDICT: NEEDS CHANGES – Please address SSRF validation and info disclosure before merge.

Addresses the automated security review on NousResearch#12329:

1. SSRF hardening — reject non-http(s) URL schemes (file://, ftp://, …)
   and run all http(s) URLs through tools.url_safety.is_safe_url before
   calling the vision tool.  Fails closed if the SSRF filter module is
   unavailable.  data: URLs are allowed (they stay local).

2. Information disclosure — exception messages and upstream error
   strings from vision_analyze_tool are now logged internally only;
   the user-facing / agent-visible output collapses to a neutral
   "[The user attached an image but it could not be processed.]"
   so internal paths / stack detail never flow into the LLM prompt.

3. File content validation — data: URL payloads are now:
   - base64-decoded under try/except (invalid payloads rejected),
   - bounded by a 15 MiB per-image cap (API_SERVER_MAX_REQUEST_BYTES
     covered request-level sizing; this adds per-image),
   - sniffed via magic bytes (PNG/JPEG/GIF/WebP); claimed Content-Type
     is advisory only.  Unrecognized payloads → ValueError.

Tests:
- Added sniffer + URL-safety unit tests (magic byte detection,
  scheme rejection, fail-closed behavior, SSRF target rejection).
- Updated vision-failure / exception-raise tests to assert details
  appear in operator logs but NOT in user output.
- Added file:// "what's in my /etc/passwd" test to confirm unsafe
  URLs never reach the vision tool.

220 existing api_server tests still pass; 41 tests on the new helpers
(up from 22).
@sunxyless

Copy link
Copy Markdown
Author

Thanks for the thorough review — all three findings addressed in ebe06b2:

🔴 SSRF

  • New _is_safe_image_url() gate at the entry of _preprocess_message_images:
    • rejects non-http(s) schemes (file://, ftp://, gopher://, javascript:, ldap://)
    • routes all http(s) URLs through tools.url_safety.is_safe_url (the existing SSRF filter that blocks loopback, link-local, private ranges, and cloud-metadata endpoints — same guard vision_analyze_tool itself uses)
    • fails closed: if tools.url_safety can't be imported, any http(s) URL is refused rather than silently bypassing the filter
  • data: URLs are still accepted (they stay local) but their content is validated separately — see below.

🟡 Information Disclosure

  • Per-image failure now emits a single, neutral user-facing note:
    [The user attached an image but it could not be processed.]
  • Exception repr(), upstream result.get("analysis")/result.get("error") strings, and exc_info=True stack traces are logged at WARNING level for operators but never interpolated into the content that gets fed back to the agent.
  • Covered by two new tests that assert the detail appears in caplog and is absent from messages[0]["content"].

🟡 Insufficient File Validation

New _materialize_data_url behavior:

  • base64.b64decode(payload, validate=False) wrapped in try/except → ValueError("invalid base64 …") on malformed input.
  • New _MAX_IMAGE_BYTES = 15 * 1024 * 1024: per-image cap on the decoded payload, enforced before the tempfile write. Complements the request-level API_SERVER_MAX_REQUEST_MB (defaults 25 MiB, configurable).
  • New _sniff_image_mime() magic-byte check (PNG / JPEG / GIF / WebP). The caller-supplied MIME in the data: header is advisory only — the sniffed type decides the tempfile suffix. Unrecognized payloads → ValueError("data: URL payload is not a recognized image format").
  • Added tests for: non-image payload rejection, invalid-base64 rejection, oversize payload rejection, and "claimed JPEG but actually PNG" → suffix follows sniff.

Test results

  • tests/gateway/test_api_server_image_preprocessing.py: 39 passed (up from 22, 17 new)
  • tests/gateway/test_api_server*.py: 239 passed (no regressions)

Also added a defensive test_fails_closed_when_url_safety_module_missing covering the import-failure path and a test_ssrf_target_rejected that monkeypatches is_safe_url to False to exercise the private-IP branch deterministically.

Let me know if anything else needs tightening.

Without an explicit ``client_max_size``, ``web.Application`` defaults to
1 MiB.  ``body_limit_middleware`` was doing nothing useful — aiohttp
silently truncated multimodal requests (base64-encoded images) at 1 MiB
before our middleware could even inspect Content-Length, and the
downstream ``request.json()`` raised, which the except handler
collapsed into an opaque "Invalid JSON in request body" error.

Pass ``client_max_size=MAX_REQUEST_BYTES`` so the middleware cap is
actually authoritative.  No behavior change for clients that stayed
under 1 MiB; fixes silent rejection for anything larger.

Discovered with Open WebUI + a 21 MB PNG.  Smaller images happened to
work because Open WebUI's client-side compression kept the final JSON
body under 1 MiB.  Excel uploads also worked because Open WebUI
extracts the xlsx to text before forwarding to the chat completions
endpoint — the body that reached api_server was tiny regardless of the
original file size.

Regression tests:
- aiohttp's 1 MiB default is asserted (baseline).
- Explicit ``client_max_size`` is honored (smoke-test the mechanism).
- Source-level assertion that production path passes the matched limit.

Also refactor two preprocessing tests to use a ``data:`` URL instead of
``https://example.com/...`` so they don't depend on
``tools.url_safety.is_safe_url`` being patchable under xdist parallel
scheduling (intermittent failure when that module was imported by
neighboring tests).
The per-image decoded-byte cap added in the security-review response was
hardcoded to 15 MiB.  That turns out to be too tight for legitimate
use — a 21 MB PNG (e.g. a game map or high-res screenshot) decodes past
the limit and the preprocessor rejects it, leaving the LLM to tell the
user "I can't see the image."

Raise the default to 50 MiB and surface it as
``API_SERVER_MAX_IMAGE_MB`` so operators can tune both bounds:

- ``API_SERVER_MAX_REQUEST_MB`` caps the whole request (default 25)
- ``API_SERVER_MAX_IMAGE_MB``   caps any single base64 image (default 50)

Memory-bomb protection is preserved: operators who want the tighter
15 MiB behavior can set ``API_SERVER_MAX_IMAGE_MB=15``.

Also rename the module constant from ``_MAX_IMAGE_BYTES`` (private) to
``MAX_IMAGE_BYTES`` to mirror ``MAX_REQUEST_BYTES`` and make the env
knob easier to introspect from tests.

Test additions parallel the ``_resolve_max_request_bytes`` suite:
default, env override (integer / fractional), empty/whitespace,
invalid, non-positive, module-constant sanity.
@sunxyless

Copy link
Copy Markdown
Author

Superseded by #12969 (merged) + #8328 (body-size follow-up, still open). Closing.

#12969's design — passing canonical multimodal structure straight through to the provider — is cleaner than my auxiliary-vision preprocessor approach, which would have been a pure fallback path for non-vision providers. Since Codex Responses + Anthropic both handle the canonical format natively via the converter in run_agent.py, there's no remaining use case for the preprocessor.

The parts of this PR that don't overlap with #12969 / #8328 are the per-image cap and the security-review hardening (SSRF filter, magic-byte validation, exception-message non-leakage), but those only made sense wrapped around the preprocessor. Happy to revive any of them as a separate, scoped PR if there's interest.

Thanks for the quick rewrite on #12969 and for the nod to the other PRs in the credits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants