Skip to content

feat(api_server): multimodal content support (images + audio)#4046

Closed
manuelschipper wants to merge 1 commit into
NousResearch:mainfrom
manuelschipper:feat/api-server-multimodal
Closed

feat(api_server): multimodal content support (images + audio)#4046
manuelschipper wants to merge 1 commit into
NousResearch:mainfrom
manuelschipper:feat/api-server-multimodal

Conversation

@manuelschipper

Copy link
Copy Markdown
Contributor

Summary

The API server's /v1/chat/completions endpoint now handles OpenAI multimodal content arrays (images and audio) instead of silently dropping non-text parts.

  • Body limit: MAX_REQUEST_BYTES raised from 1 MB to 50 MB (configurable via API_SERVER_MAX_BODY_MB env var) -- base64-encoded images exceed the old limit, causing 413 rejections
  • Images: image_url content parts are described via vision_analyze_tool and enriched as text -- same pattern as the Telegram gateway's _enrich_message_with_vision()
  • Audio: input_audio content parts are transcribed via transcribe_audio() (Whisper/Groq/OpenAI STT) -- same pattern as the Telegram gateway's _enrich_message_with_transcription()
  • Text: text/input_text parts pass through as-is; plain string content is unchanged (no regression)

This enables any OpenAI-compatible frontend (Open WebUI, oye, LibreChat, etc.) to send images and voice messages through the API server.

Test plan

  • Text-only messages work as before (regression)
  • Content array with text parts only -- extracted correctly
  • image_url with base64 data URI -- vision describes it, agent responds
  • input_audio with base64 webm/wav -- STT transcribes, agent responds to transcript
  • Large image (5 MB) doesn't hit body limit
  • Vision/STT failure doesn't crash the request (graceful fallback message)

The API server's /v1/chat/completions endpoint now handles OpenAI
multimodal content arrays instead of dropping non-text parts.

**Changes:**

- Raise MAX_REQUEST_BYTES from 1 MB to 50 MB (configurable via
  API_SERVER_MAX_BODY_MB env var) — base64-encoded images easily
  exceed the old limit, causing silent 413 rejections.

- Add _process_multimodal_content() that replicates the Telegram
  gateway's text-enrichment pattern:
  - image_url parts → described via vision_analyze_tool, with the
    local cache path included so the agent can re-examine if needed
  - input_audio parts → transcribed via transcribe_audio (same
    Whisper/Groq/OpenAI STT pipeline as Telegram voice messages)
  - text/input_text parts → passed through as-is

- Wire processing into _handle_chat_completions before user_message
  extraction, so the agent receives enriched plain text.

This enables any OpenAI-compatible frontend (Open WebUI, oye, etc.)
to send images and voice messages through the API server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@manuelschipper manuelschipper force-pushed the feat/api-server-multimodal branch from d6bf667 to 39b41f6 Compare March 31, 2026 06:31
manuelschipper pushed a commit to manuelschipper/hermes-agent that referenced this pull request Apr 20, 2026
…ltimodal, file attachments

Local monkey patch on top of upstream NousResearch/hermes-agent. Connects
Hermes' API server to Oye's hermes-aware SSE consumer. Four logically
distinct features bundled into one commit because they all touch
`gateway/platforms/api_server.py` and would conflict with each other on
cherry-pick.

This commit message is the canonical reference for re-applying the patch
after a future `hermes update` reset. Read it end-to-end before re-doing
the cherry-pick — the upstream-mirror PRs (NousResearch#4046, NousResearch#4265) are still OPEN
so we will keep maintaining this locally for a while.

================================================================
Feature 1 — Reasoning callback in SSE stream
================================================================

Goal: emit `delta.reasoning_content` chunks on the chat-completions SSE
stream so Oye renders the agent's thinking in a separate UI element.

Wiring:
  * Add `reasoning_callback=None` parameter to `_create_agent()` and
    `_run_agent()` (both signature lines and the inner agent constructor
    call). AIAgent (run_agent.py:521) accepts this parameter natively.
  * In `_handle_chat_completions`, allocate `_reasoning_q = _q.Queue()`.
  * Define `_on_reasoning(text)` that pushes onto `_reasoning_q`.
  * Pass `_on_reasoning` as `reasoning_callback=` into `_run_agent()`.
  * Pass `reasoning_q=_reasoning_q` into `_write_sse_chat_completion()`.
  * Add `reasoning_q=None` parameter to `_write_sse_chat_completion()`.
  * Inside `_write_sse_chat_completion`, define a nested
    `_drain_side_queues()` that drains `reasoning_q` and emits each text
    chunk as `data: {"choices":[{"delta":{"reasoning_content": text}}]}`.
  * Call `_drain_side_queues()` in the SSE main loop both before each
    poll and on final flush.

Upstream status: there is NO reasoning_callback support anywhere in
upstream `gateway/platforms/api_server.py`. PR NousResearch#4265 (open) covers this.
Without this patch, Oye sees zero reasoning content even though the
underlying AIAgent fires reasoning callbacks.

================================================================
Feature 2 — Tool progress callback as a separate SSE event channel
================================================================

Goal: emit `event: tool_progress` SSE custom events for each tool call so
Oye renders tool activity badges in a separate UI element (NOT inline
markdown in the assistant response).

Wiring (parallel to the reasoning wiring above):
  * Add `tool_progress_callback=None` parameter to `_create_agent()` and
    `_run_agent()` and pass it through to AIAgent.
  * Allocate `_progress_q = _q.Queue()` in `_handle_chat_completions`.
  * Define `_on_tool_progress(event, name=None, preview=None, args=None,
                              **kwargs)` — see "Callback signature" below.
  * Pass `_on_tool_progress` as `tool_progress_callback=` into
    `_run_agent()`.
  * Pass `progress_q=_progress_q` into `_write_sse_chat_completion()`.
  * Add `progress_q=None` parameter to `_write_sse_chat_completion()`.
  * Inside `_drain_side_queues()`, drain `progress_q` and emit each item
    as `event: tool_progress\ndata: {json}\n\n`.

Callback signature — IMPORTANT:
  AIAgent (since upstream commit cc2b56b) calls tool_progress_callback
  with a 4-arg signature plus optional kwargs:
    tool_progress_callback("tool.started", name, preview, args)
    tool_progress_callback("tool.completed", name, None, None,
                           duration=..., is_error=...)
    tool_progress_callback("_thinking", first_line)

  An older 3-arg signature `(name, preview, args)` will silently fail
  with TypeError that gets swallowed at run_agent.py:6207, producing
  ZERO tool_progress events on the wire. This is the bug we hit on
  2026-04-07 after upgrading to v0.7.0.

Event filtering — IMPORTANT:
  Oye renders ONE visual badge per emitted event (`appendThinkingTool`
  in oye/static/generation-store.js does not dedupe). To avoid
  duplicate-empty-badge noise, this callback applies these rules:

    if event == "_thinking":              return  # internal preview
    if name and name.startswith("_"):     return  # internal tool name
    if event == "tool.started":           emit {tool, preview}
    if event == "tool.completed" and is_error:
                                          emit {tool, preview="✗ failed (Xs)"}
    # tool.completed (success), unknown:  drop silently

  The `✗ failed (Xs)` preview uses the `duration` kwarg from AIAgent and
  is intentionally visually distinct from any started-event preview so
  Oye does not render it as another tool invocation.

Payload format consumed by Oye:
  Oye's parser (oye/sse.py + oye/cli_chat.py:_render_tool_progress and
  oye/static/generation-store.js:appendToolCall/appendThinkingTool)
  expects exactly: {"tool": str, "preview": str}.

Upstream status: PR NousResearch#4092 (`1e59d481`) added a DIFFERENT tool_progress
mechanism — it injects tool progress as inline markdown into the main
content stream via `_stream_q.put(f"`{emoji} {label}`")`. That mixes
tool activity into the assistant's response text and loses the
structured-channel UX Oye renders. We replace upstream's `_on_tool_progress`
on cherry-pick. Our SSE-channel approach is in PR NousResearch#4265 (open).

================================================================
Feature 3 — Multimodal content preprocessing
================================================================

Goal: accept large multimodal request bodies and preprocess images/audio
into text descriptions before the agent sees them.

Wiring:
  * Raise `MAX_REQUEST_BYTES` from 1 MB to 50 MB
    (configurable via `API_SERVER_MAX_BODY_MB` env var).
  * Add `_process_multimodal_content(self, user_message_content) -> str`
    method that:
      - Parses OpenAI content arrays (list of {type, text|image_url|...}).
      - Describes images via `vision_analyze_tool`.
      - Transcribes audio via `transcribe_audio`.
      - Returns enriched plain text.
    (Same pattern as the Telegram gateway adapter.)
  * Wire it into `_handle_chat_completions` BEFORE user_message
    extraction:
      `last["content"] = await self._process_multimodal_content(
                              last.get("content", ""))`

Upstream status: PR NousResearch#4046 (open). Upstream commit `71e81728` added a
DIFFERENT approach (Codex OAuth vision pass-through inside
`_CodexCompletionsAdapter`); that only handles images on the
`openai-codex` provider and does not cover audio transcription, so it
is not a replacement.

================================================================
Feature 4 — File attachment handling for Oye (mold-38)
================================================================

Goal: accept `{type: "file", file: {filename, file_data}}` content parts
(used by Oye for PDF/docx/xlsx/csv/etc. uploads), persist them to a
sandbox-visible cache, and tell the agent where to find them so it can
read them with its terminal toolchain.

Without this branch, the loop only handles text/input_text/image_url/
input_audio and silently drops file parts — the agent sees the user's
question with no document attached and acts as if nothing was sent.

Wiring:

* New imports: `base64`, `pathlib.Path`.

* New module-level constants (top of file, after MAX_REQUEST_BYTES):
    OYE_DOCUMENT_CACHE_DIR  = Path(\$HERMES_HOME) / 'oye_documents'
    OYE_SANDBOX_CACHE_PATH  = '/home/pn/.hermes/cache/oye-documents'
    OYE_DOCUMENT_MAX_AGE_SECONDS = 24 * 3600
    OYE_INLINE_MAX_BYTES    = 100 * 1024
    OYE_INLINE_EXTENSIONS   = {.md .txt .csv .tsv .json .yaml .yml .xml .html .htm}
    _OYE_SUPPORTED_DOCUMENT_TYPES = {21 entries: pdf, md, txt, csv, tsv,
        json, yaml, yml, xml, html, htm, rtf, zip, docx, xlsx, pptx, odt,
        epub, ipynb}

* New module-level helpers (mirroring gateway/platforms/base.py
  cache_document_from_bytes line for line, just pointed at a different
  cache dir):
    _cache_oye_document(data, filename) -> str
        - mkdir parents
        - sanitize filename (Path(name).name + strip control chars +
          fall back to 'document' for empty/./..)
        - prefix with doc_<uuid12>_ for collision safety
        - is_relative_to() path-traversal guard
        - write bytes, return absolute gateway-internal path
    _to_sandbox_oye_path(p) -> str
        - replace OYE_DOCUMENT_CACHE_DIR prefix with
          OYE_SANDBOX_CACHE_PATH
        - assert prefix matches before substitution; raise on mismatch
    _cleanup_oye_documents(max_age_seconds=OYE_DOCUMENT_MAX_AGE_SECONDS) -> int
        - walk OYE_DOCUMENT_CACHE_DIR, unlink files older than threshold
        - returns count removed; swallows OSError per file

* New \`elif ptype == \"file\":\` branch in
  _process_multimodal_content (joins a new file_descriptions list,
  inserted into the enriched output between audio_transcripts and
  text_parts so the agent reads orientation BEFORE the user question):

    1. Pull filename and file_data from part['file'].
    2. Strip data URL header, base64.b64decode the body. On decode
       failure, append loud error note and continue.
    3. Look extension up in _OYE_SUPPORTED_DOCUMENT_TYPES. If
       unsupported, append loud note and continue. (Slack/Discord skip
       silently — for the API-server path we are louder, since there is
       no other channel for the user to learn the file was dropped.)
    4. _cache_oye_document(raw, filename). On error, append loud cache
       note and continue.
    5. _cleanup_oye_documents() — best-effort 24h GC on every write to
       bound the cache without patching gateway/run.py's cron ticker.
    6. _to_sandbox_oye_path(cached_path).
    7. Append orientation note in the same shape as image/audio:
         '[The user attached <name> (<mime>, <kb> KB) at <sandbox path>
          — read it with the terminal tool when you need to.]'
    8. For OYE_INLINE_EXTENSIONS under OYE_INLINE_MAX_BYTES, also append
       '[Content of <name>]:\\n<text>' (mirrors slack.py:864-877 and
       discord.py:2366-2379 exactly). Skip on UnicodeDecodeError.

Why a separate oye_documents cache instead of reusing document_cache:

The upstream document_cache auto-mount in tools/credential_files.py:357
(get_cache_directory_mounts) computes host paths from inside the gateway
container. For any non-CreatBot bot, this produces the wrong host path
because the bot home is bind-mounted as /home/dev/.hermes inside the
gateway via the compose trick (e.g. /home/dev/.hermes-sunshine:
/home/dev/.hermes for Sunshine). The docker daemon then bind-mounts
/home/dev/.hermes/document_cache from the host — which is CreatBot's
parent, not Sunshine's. Image/audio paths have hidden the same bug
because vision/transcription run inside the gateway and never use the
sandbox mount; document handling is the first flow that exercises the
mount end-to-end.

Mold-38 sidesteps the bug by using a fully separate, explicitly-mounted
cache wired via each bot's terminal.docker_volumes:
  CreatBot: /home/dev/.hermes/oye_documents:/home/pn/.hermes/cache/oye-documents:rw
  Sunshine: /home/dev/.hermes-sunshine/oye_documents:/home/pn/.hermes/cache/oye-documents:rw

The destination /home/pn/.hermes/cache/oye-documents deliberately
differs from the auto-injected /root/.hermes/cache/documents (which is
both broken AND unreadable to the sandbox's pn user, since /root is mode
700). The auto-mount is NOT touched by this patch.

Follow-up fleet mold (NOT in mold-38) should:
- Introduce HERMES_HOST_HOME env var per bot in each compose file.
- Patch get_cache_directory_mounts to substitute HERMES_HOME ->
  HERMES_HOST_HOME when computing host paths.
- Migrate Oye from oye_documents back onto the shared cache/documents
  and collapse _cache_oye_document into the upstream helper.

Upstream status: nothing equivalent in api_server.py on origin/main.
The OpenAI \`type: file\` content shape is supported by the upstream
Chat Completions API spec but no upstream gateway processes it. Worth
opening a small PR to upstream the type-set + branch (without the
oye_documents sidestep — that part is fleet-specific).

================================================================
Re-applying after a hermes upgrade
================================================================

When \`hermes update\` (or a manual git pull) brings in new upstream
commits, this patch needs to be re-applied. Recommended procedure:

  1. Save the current monkey-patched file as a reference:
       cp gateway/platforms/api_server.py /tmp/api_server.MONKEYPATCHED

  2. Update main:
       git checkout main
       git pull --ff-only origin main   # or reset --hard if diverged

  3. Try cherry-pick first (will likely conflict on the file above):
       git cherry-pick <previous-monkey-patch-sha>

  4. For each conflict region, the rule is:
       - Take upstream's NEW additions (session_db, fallback_model,
         session_id parameters added since the last patch).
       - Keep our additions (reasoning_callback, _progress_q,
         _reasoning_q, _drain_side_queues, _process_multimodal_content,
         MAX_REQUEST_BYTES bump, OYE_DOCUMENT_CACHE_DIR + helpers,
         the file branch).
       - Replace upstream's \`_on_tool_progress(name, preview, args)\`
         (the inline-markdown one from PR NousResearch#4092) with our queue-based
         version that matches the AIAgent 4-arg signature above.

  5. Verify all features after rebuild:
       a. Hermes syntax check:
            python3 -c \"import ast; ast.parse(open(
              'gateway/platforms/api_server.py').read())\"
       b. Reinstall venv deps:
            uv pip install -e \".[all]\"
       c. Clear bytecode:
            find . -type d -name __pycache__ -exec rm -rf {} +
       d. Restart bot with the 75s telegram-polling restart gap
          (see deploy-hermes skill — \`down\`, sleep 75s, \`up -d\`).
       e. Test reasoning + tool_progress + file attachments end-to-end
          via Oye web upload.

  6. If cherry-pick is too conflict-prone (>5 hunks), fall back to:
       diff /tmp/api_server.MONKEYPATCHED gateway/platforms/api_server.py
     and re-apply additions manually using the feature descriptions in
     this commit message as your contract.

================================================================
Files touched
================================================================

  gateway/platforms/api_server.py    # all of the above

Nothing else. The patch deliberately stays in one file so the bridge
layer stays self-contained and easy to spot in \`git log\`.

================================================================
Related upstream PRs
================================================================

  NousResearch#4046 — multimodal content support (still OPEN)
  NousResearch#4265 — tool_progress + reasoning SSE wiring (still OPEN)

When/if either merges, drop the corresponding feature from this commit.
File attachment handling (Feature 4) has no upstream PR yet.
teknium1 added a commit that referenced this pull request Apr 20, 2026
…/responses

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes #5621, #8253, #4046, #6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
teknium1 added a commit that referenced this pull request Apr 20, 2026
…/responses (#12969)

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes #5621, #8253, #4046, #6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Closed in favor of #12969 (merged as f683132). Multimodal image inputs are now supported on both /v1/chat/completions and /v1/responses. Audio parts remain out of scope for a future PR. Credited in the merged PR body.

@teknium1 teknium1 closed this Apr 20, 2026
@manuelschipper

Copy link
Copy Markdown
Contributor Author

Thanks @teknium1 for closing this out via #12969 and for calling out audio as future work. I split the remaining audio piece into a narrower follow-up PR here: #13184.

That PR only adds input_audio support on the final user message for /v1/chat/completions, reuses Hermes’ existing STT path, keeps /v1/responses out of scope, and leaves the merged image-input behavior untouched. Focused tests and docs are included.

ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…/responses (NousResearch#12969)

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…/responses (NousResearch#12969)

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
Luminet2023 pushed a commit to Luminet2023/hermes-agent that referenced this pull request May 1, 2026
…/responses (NousResearch#12969)

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…/responses (NousResearch#12969)

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…/responses (NousResearch#12969)

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…/responses (NousResearch#12969)

OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:

  Chat Completions: {type: text|image_url, image_url: {url, detail?}}
  Responses:        {type: input_text|input_image, image_url: <str>, detail?}

The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.

Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.

Changes:

- gateway/platforms/api_server.py
  - _normalize_multimodal_content(): validates + normalizes both Chat and
    Responses content shapes. Returns a plain string for text-only content
    (preserves prompt-cache behavior on existing callers) or a canonical
    [{type:text|image_url,...}] list when images are present.
  - _content_has_visible_payload(): replaces the bare truthy check so a
    user turn with only an image no longer rejects as 'No user message'.
  - _handle_chat_completions and _handle_responses both call the new helper
    for user/assistant content; system messages continue to flatten to text.
  - Codex conversation_history, input[], and inline history paths all share
    the same validator. No duplicated normalizers.

- run_agent.py
  - _summarize_user_message_for_log(): produces a short string summary
    ('[1 image] describe this') from list content for logging, spinner
    previews, and trajectory writes. Fixes AttributeError when list
    user_message hit user_message[:80] + '...' / .replace().
  - _chat_content_to_responses_parts(): module-level helper that converts
    chat-style multimodal content to Responses 'input_text'/'input_image'
    parts. Used in _chat_messages_to_responses_input for Codex routing.
  - _preflight_codex_input_items() now validates and passes through list
    content parts for user/assistant messages instead of stringifying.

- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
  - Unit coverage for _normalize_multimodal_content, including both part
    formats, data URL gating, and all reject paths.
  - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
    verifying multimodal payloads reach _run_agent intact.
  - 400 coverage for file / input_file / non-image data URL.

- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
  - Regression coverage for the prologue no-crash contract.
  - _chat_content_to_responses_parts round-trip coverage.

- website/docs/user-guide/features/api-server.md
  - Inline image examples for both endpoints.
  - Updated Limitations: files still unsupported, images now supported.

Validated live against openrouter/anthropic/claude-opus-4.6:
  POST /v1/chat/completions  → 200, vision-accurate description
  POST /v1/responses         → 200, same image, clean output_text
  POST /v1/chat/completions [file] → 400 unsupported_content_type
  POST /v1/responses [input_file]  → 400 unsupported_content_type
  POST /v1/responses [non-image data URL] → 400 unsupported_content_type

Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632.

Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants