feat(api_server): multimodal content support (images + audio)#4046
Closed
manuelschipper wants to merge 1 commit into
Closed
feat(api_server): multimodal content support (images + audio)#4046manuelschipper wants to merge 1 commit into
manuelschipper wants to merge 1 commit into
Conversation
The API server's /v1/chat/completions endpoint now handles OpenAI
multimodal content arrays instead of dropping non-text parts.
**Changes:**
- Raise MAX_REQUEST_BYTES from 1 MB to 50 MB (configurable via
API_SERVER_MAX_BODY_MB env var) — base64-encoded images easily
exceed the old limit, causing silent 413 rejections.
- Add _process_multimodal_content() that replicates the Telegram
gateway's text-enrichment pattern:
- image_url parts → described via vision_analyze_tool, with the
local cache path included so the agent can re-examine if needed
- input_audio parts → transcribed via transcribe_audio (same
Whisper/Groq/OpenAI STT pipeline as Telegram voice messages)
- text/input_text parts → passed through as-is
- Wire processing into _handle_chat_completions before user_message
extraction, so the agent receives enriched plain text.
This enables any OpenAI-compatible frontend (Open WebUI, oye, etc.)
to send images and voice messages through the API server.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d6bf667 to
39b41f6
Compare
19 tasks
manuelschipper
pushed a commit
to manuelschipper/hermes-agent
that referenced
this pull request
Apr 20, 2026
…ltimodal, file attachments Local monkey patch on top of upstream NousResearch/hermes-agent. Connects Hermes' API server to Oye's hermes-aware SSE consumer. Four logically distinct features bundled into one commit because they all touch `gateway/platforms/api_server.py` and would conflict with each other on cherry-pick. This commit message is the canonical reference for re-applying the patch after a future `hermes update` reset. Read it end-to-end before re-doing the cherry-pick — the upstream-mirror PRs (NousResearch#4046, NousResearch#4265) are still OPEN so we will keep maintaining this locally for a while. ================================================================ Feature 1 — Reasoning callback in SSE stream ================================================================ Goal: emit `delta.reasoning_content` chunks on the chat-completions SSE stream so Oye renders the agent's thinking in a separate UI element. Wiring: * Add `reasoning_callback=None` parameter to `_create_agent()` and `_run_agent()` (both signature lines and the inner agent constructor call). AIAgent (run_agent.py:521) accepts this parameter natively. * In `_handle_chat_completions`, allocate `_reasoning_q = _q.Queue()`. * Define `_on_reasoning(text)` that pushes onto `_reasoning_q`. * Pass `_on_reasoning` as `reasoning_callback=` into `_run_agent()`. * Pass `reasoning_q=_reasoning_q` into `_write_sse_chat_completion()`. * Add `reasoning_q=None` parameter to `_write_sse_chat_completion()`. * Inside `_write_sse_chat_completion`, define a nested `_drain_side_queues()` that drains `reasoning_q` and emits each text chunk as `data: {"choices":[{"delta":{"reasoning_content": text}}]}`. * Call `_drain_side_queues()` in the SSE main loop both before each poll and on final flush. Upstream status: there is NO reasoning_callback support anywhere in upstream `gateway/platforms/api_server.py`. PR NousResearch#4265 (open) covers this. Without this patch, Oye sees zero reasoning content even though the underlying AIAgent fires reasoning callbacks. ================================================================ Feature 2 — Tool progress callback as a separate SSE event channel ================================================================ Goal: emit `event: tool_progress` SSE custom events for each tool call so Oye renders tool activity badges in a separate UI element (NOT inline markdown in the assistant response). Wiring (parallel to the reasoning wiring above): * Add `tool_progress_callback=None` parameter to `_create_agent()` and `_run_agent()` and pass it through to AIAgent. * Allocate `_progress_q = _q.Queue()` in `_handle_chat_completions`. * Define `_on_tool_progress(event, name=None, preview=None, args=None, **kwargs)` — see "Callback signature" below. * Pass `_on_tool_progress` as `tool_progress_callback=` into `_run_agent()`. * Pass `progress_q=_progress_q` into `_write_sse_chat_completion()`. * Add `progress_q=None` parameter to `_write_sse_chat_completion()`. * Inside `_drain_side_queues()`, drain `progress_q` and emit each item as `event: tool_progress\ndata: {json}\n\n`. Callback signature — IMPORTANT: AIAgent (since upstream commit cc2b56b) calls tool_progress_callback with a 4-arg signature plus optional kwargs: tool_progress_callback("tool.started", name, preview, args) tool_progress_callback("tool.completed", name, None, None, duration=..., is_error=...) tool_progress_callback("_thinking", first_line) An older 3-arg signature `(name, preview, args)` will silently fail with TypeError that gets swallowed at run_agent.py:6207, producing ZERO tool_progress events on the wire. This is the bug we hit on 2026-04-07 after upgrading to v0.7.0. Event filtering — IMPORTANT: Oye renders ONE visual badge per emitted event (`appendThinkingTool` in oye/static/generation-store.js does not dedupe). To avoid duplicate-empty-badge noise, this callback applies these rules: if event == "_thinking": return # internal preview if name and name.startswith("_"): return # internal tool name if event == "tool.started": emit {tool, preview} if event == "tool.completed" and is_error: emit {tool, preview="✗ failed (Xs)"} # tool.completed (success), unknown: drop silently The `✗ failed (Xs)` preview uses the `duration` kwarg from AIAgent and is intentionally visually distinct from any started-event preview so Oye does not render it as another tool invocation. Payload format consumed by Oye: Oye's parser (oye/sse.py + oye/cli_chat.py:_render_tool_progress and oye/static/generation-store.js:appendToolCall/appendThinkingTool) expects exactly: {"tool": str, "preview": str}. Upstream status: PR NousResearch#4092 (`1e59d481`) added a DIFFERENT tool_progress mechanism — it injects tool progress as inline markdown into the main content stream via `_stream_q.put(f"`{emoji} {label}`")`. That mixes tool activity into the assistant's response text and loses the structured-channel UX Oye renders. We replace upstream's `_on_tool_progress` on cherry-pick. Our SSE-channel approach is in PR NousResearch#4265 (open). ================================================================ Feature 3 — Multimodal content preprocessing ================================================================ Goal: accept large multimodal request bodies and preprocess images/audio into text descriptions before the agent sees them. Wiring: * Raise `MAX_REQUEST_BYTES` from 1 MB to 50 MB (configurable via `API_SERVER_MAX_BODY_MB` env var). * Add `_process_multimodal_content(self, user_message_content) -> str` method that: - Parses OpenAI content arrays (list of {type, text|image_url|...}). - Describes images via `vision_analyze_tool`. - Transcribes audio via `transcribe_audio`. - Returns enriched plain text. (Same pattern as the Telegram gateway adapter.) * Wire it into `_handle_chat_completions` BEFORE user_message extraction: `last["content"] = await self._process_multimodal_content( last.get("content", ""))` Upstream status: PR NousResearch#4046 (open). Upstream commit `71e81728` added a DIFFERENT approach (Codex OAuth vision pass-through inside `_CodexCompletionsAdapter`); that only handles images on the `openai-codex` provider and does not cover audio transcription, so it is not a replacement. ================================================================ Feature 4 — File attachment handling for Oye (mold-38) ================================================================ Goal: accept `{type: "file", file: {filename, file_data}}` content parts (used by Oye for PDF/docx/xlsx/csv/etc. uploads), persist them to a sandbox-visible cache, and tell the agent where to find them so it can read them with its terminal toolchain. Without this branch, the loop only handles text/input_text/image_url/ input_audio and silently drops file parts — the agent sees the user's question with no document attached and acts as if nothing was sent. Wiring: * New imports: `base64`, `pathlib.Path`. * New module-level constants (top of file, after MAX_REQUEST_BYTES): OYE_DOCUMENT_CACHE_DIR = Path(\$HERMES_HOME) / 'oye_documents' OYE_SANDBOX_CACHE_PATH = '/home/pn/.hermes/cache/oye-documents' OYE_DOCUMENT_MAX_AGE_SECONDS = 24 * 3600 OYE_INLINE_MAX_BYTES = 100 * 1024 OYE_INLINE_EXTENSIONS = {.md .txt .csv .tsv .json .yaml .yml .xml .html .htm} _OYE_SUPPORTED_DOCUMENT_TYPES = {21 entries: pdf, md, txt, csv, tsv, json, yaml, yml, xml, html, htm, rtf, zip, docx, xlsx, pptx, odt, epub, ipynb} * New module-level helpers (mirroring gateway/platforms/base.py cache_document_from_bytes line for line, just pointed at a different cache dir): _cache_oye_document(data, filename) -> str - mkdir parents - sanitize filename (Path(name).name + strip control chars + fall back to 'document' for empty/./..) - prefix with doc_<uuid12>_ for collision safety - is_relative_to() path-traversal guard - write bytes, return absolute gateway-internal path _to_sandbox_oye_path(p) -> str - replace OYE_DOCUMENT_CACHE_DIR prefix with OYE_SANDBOX_CACHE_PATH - assert prefix matches before substitution; raise on mismatch _cleanup_oye_documents(max_age_seconds=OYE_DOCUMENT_MAX_AGE_SECONDS) -> int - walk OYE_DOCUMENT_CACHE_DIR, unlink files older than threshold - returns count removed; swallows OSError per file * New \`elif ptype == \"file\":\` branch in _process_multimodal_content (joins a new file_descriptions list, inserted into the enriched output between audio_transcripts and text_parts so the agent reads orientation BEFORE the user question): 1. Pull filename and file_data from part['file']. 2. Strip data URL header, base64.b64decode the body. On decode failure, append loud error note and continue. 3. Look extension up in _OYE_SUPPORTED_DOCUMENT_TYPES. If unsupported, append loud note and continue. (Slack/Discord skip silently — for the API-server path we are louder, since there is no other channel for the user to learn the file was dropped.) 4. _cache_oye_document(raw, filename). On error, append loud cache note and continue. 5. _cleanup_oye_documents() — best-effort 24h GC on every write to bound the cache without patching gateway/run.py's cron ticker. 6. _to_sandbox_oye_path(cached_path). 7. Append orientation note in the same shape as image/audio: '[The user attached <name> (<mime>, <kb> KB) at <sandbox path> — read it with the terminal tool when you need to.]' 8. For OYE_INLINE_EXTENSIONS under OYE_INLINE_MAX_BYTES, also append '[Content of <name>]:\\n<text>' (mirrors slack.py:864-877 and discord.py:2366-2379 exactly). Skip on UnicodeDecodeError. Why a separate oye_documents cache instead of reusing document_cache: The upstream document_cache auto-mount in tools/credential_files.py:357 (get_cache_directory_mounts) computes host paths from inside the gateway container. For any non-CreatBot bot, this produces the wrong host path because the bot home is bind-mounted as /home/dev/.hermes inside the gateway via the compose trick (e.g. /home/dev/.hermes-sunshine: /home/dev/.hermes for Sunshine). The docker daemon then bind-mounts /home/dev/.hermes/document_cache from the host — which is CreatBot's parent, not Sunshine's. Image/audio paths have hidden the same bug because vision/transcription run inside the gateway and never use the sandbox mount; document handling is the first flow that exercises the mount end-to-end. Mold-38 sidesteps the bug by using a fully separate, explicitly-mounted cache wired via each bot's terminal.docker_volumes: CreatBot: /home/dev/.hermes/oye_documents:/home/pn/.hermes/cache/oye-documents:rw Sunshine: /home/dev/.hermes-sunshine/oye_documents:/home/pn/.hermes/cache/oye-documents:rw The destination /home/pn/.hermes/cache/oye-documents deliberately differs from the auto-injected /root/.hermes/cache/documents (which is both broken AND unreadable to the sandbox's pn user, since /root is mode 700). The auto-mount is NOT touched by this patch. Follow-up fleet mold (NOT in mold-38) should: - Introduce HERMES_HOST_HOME env var per bot in each compose file. - Patch get_cache_directory_mounts to substitute HERMES_HOME -> HERMES_HOST_HOME when computing host paths. - Migrate Oye from oye_documents back onto the shared cache/documents and collapse _cache_oye_document into the upstream helper. Upstream status: nothing equivalent in api_server.py on origin/main. The OpenAI \`type: file\` content shape is supported by the upstream Chat Completions API spec but no upstream gateway processes it. Worth opening a small PR to upstream the type-set + branch (without the oye_documents sidestep — that part is fleet-specific). ================================================================ Re-applying after a hermes upgrade ================================================================ When \`hermes update\` (or a manual git pull) brings in new upstream commits, this patch needs to be re-applied. Recommended procedure: 1. Save the current monkey-patched file as a reference: cp gateway/platforms/api_server.py /tmp/api_server.MONKEYPATCHED 2. Update main: git checkout main git pull --ff-only origin main # or reset --hard if diverged 3. Try cherry-pick first (will likely conflict on the file above): git cherry-pick <previous-monkey-patch-sha> 4. For each conflict region, the rule is: - Take upstream's NEW additions (session_db, fallback_model, session_id parameters added since the last patch). - Keep our additions (reasoning_callback, _progress_q, _reasoning_q, _drain_side_queues, _process_multimodal_content, MAX_REQUEST_BYTES bump, OYE_DOCUMENT_CACHE_DIR + helpers, the file branch). - Replace upstream's \`_on_tool_progress(name, preview, args)\` (the inline-markdown one from PR NousResearch#4092) with our queue-based version that matches the AIAgent 4-arg signature above. 5. Verify all features after rebuild: a. Hermes syntax check: python3 -c \"import ast; ast.parse(open( 'gateway/platforms/api_server.py').read())\" b. Reinstall venv deps: uv pip install -e \".[all]\" c. Clear bytecode: find . -type d -name __pycache__ -exec rm -rf {} + d. Restart bot with the 75s telegram-polling restart gap (see deploy-hermes skill — \`down\`, sleep 75s, \`up -d\`). e. Test reasoning + tool_progress + file attachments end-to-end via Oye web upload. 6. If cherry-pick is too conflict-prone (>5 hunks), fall back to: diff /tmp/api_server.MONKEYPATCHED gateway/platforms/api_server.py and re-apply additions manually using the feature descriptions in this commit message as your contract. ================================================================ Files touched ================================================================ gateway/platforms/api_server.py # all of the above Nothing else. The patch deliberately stays in one file so the bridge layer stays self-contained and easy to spot in \`git log\`. ================================================================ Related upstream PRs ================================================================ NousResearch#4046 — multimodal content support (still OPEN) NousResearch#4265 — tool_progress + reasoning SSE wiring (still OPEN) When/if either merges, drop the corresponding feature from this commit. File attachment handling (Feature 4) has no upstream PR yet.
teknium1
added a commit
that referenced
this pull request
Apr 20, 2026
…/responses
OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:
Chat Completions: {type: text|image_url, image_url: {url, detail?}}
Responses: {type: input_text|input_image, image_url: <str>, detail?}
The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.
Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.
Changes:
- gateway/platforms/api_server.py
- _normalize_multimodal_content(): validates + normalizes both Chat and
Responses content shapes. Returns a plain string for text-only content
(preserves prompt-cache behavior on existing callers) or a canonical
[{type:text|image_url,...}] list when images are present.
- _content_has_visible_payload(): replaces the bare truthy check so a
user turn with only an image no longer rejects as 'No user message'.
- _handle_chat_completions and _handle_responses both call the new helper
for user/assistant content; system messages continue to flatten to text.
- Codex conversation_history, input[], and inline history paths all share
the same validator. No duplicated normalizers.
- run_agent.py
- _summarize_user_message_for_log(): produces a short string summary
('[1 image] describe this') from list content for logging, spinner
previews, and trajectory writes. Fixes AttributeError when list
user_message hit user_message[:80] + '...' / .replace().
- _chat_content_to_responses_parts(): module-level helper that converts
chat-style multimodal content to Responses 'input_text'/'input_image'
parts. Used in _chat_messages_to_responses_input for Codex routing.
- _preflight_codex_input_items() now validates and passes through list
content parts for user/assistant messages instead of stringifying.
- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
- Unit coverage for _normalize_multimodal_content, including both part
formats, data URL gating, and all reject paths.
- Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
verifying multimodal payloads reach _run_agent intact.
- 400 coverage for file / input_file / non-image data URL.
- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
- Regression coverage for the prologue no-crash contract.
- _chat_content_to_responses_parts round-trip coverage.
- website/docs/user-guide/features/api-server.md
- Inline image examples for both endpoints.
- Updated Limitations: files still unsupported, images now supported.
Validated live against openrouter/anthropic/claude-opus-4.6:
POST /v1/chat/completions → 200, vision-accurate description
POST /v1/responses → 200, same image, clean output_text
POST /v1/chat/completions [file] → 400 unsupported_content_type
POST /v1/responses [input_file] → 400 unsupported_content_type
POST /v1/responses [non-image data URL] → 400 unsupported_content_type
Closes #5621, #8253, #4046, #6632.
Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
teknium1
added a commit
that referenced
this pull request
Apr 20, 2026
…/responses (#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes #5621, #8253, #4046, #6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
Contributor
Contributor
Author
|
Thanks @teknium1 for closing this out via #12969 and for calling out audio as future work. I split the remaining audio piece into a narrower follow-up PR here: #13184. That PR only adds |
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…/responses (NousResearch#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
…/responses (NousResearch#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
Luminet2023
pushed a commit
to Luminet2023/hermes-agent
that referenced
this pull request
May 1, 2026
…/responses (NousResearch#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…/responses (NousResearch#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…/responses (NousResearch#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…/responses (NousResearch#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The API server's
/v1/chat/completionsendpoint now handles OpenAI multimodal content arrays (images and audio) instead of silently dropping non-text parts.MAX_REQUEST_BYTESraised from 1 MB to 50 MB (configurable viaAPI_SERVER_MAX_BODY_MBenv var) -- base64-encoded images exceed the old limit, causing 413 rejectionsimage_urlcontent parts are described viavision_analyze_tooland enriched as text -- same pattern as the Telegram gateway's_enrich_message_with_vision()input_audiocontent parts are transcribed viatranscribe_audio()(Whisper/Groq/OpenAI STT) -- same pattern as the Telegram gateway's_enrich_message_with_transcription()text/input_textparts pass through as-is; plain string content is unchanged (no regression)This enables any OpenAI-compatible frontend (Open WebUI, oye, LibreChat, etc.) to send images and voice messages through the API server.
Test plan
textparts only -- extracted correctlyimage_urlwith base64 data URI -- vision describes it, agent respondsinput_audiowith base64 webm/wav -- STT transcribes, agent responds to transcript