feat(api_server): multimodal content support (images + audio) by manuelschipper · Pull Request #4046 · NousResearch/hermes-agent

manuelschipper · 2026-03-30T21:38:41Z

Summary

The API server's /v1/chat/completions endpoint now handles OpenAI multimodal content arrays (images and audio) instead of silently dropping non-text parts.

Body limit: MAX_REQUEST_BYTES raised from 1 MB to 50 MB (configurable via API_SERVER_MAX_BODY_MB env var) -- base64-encoded images exceed the old limit, causing 413 rejections
Images: image_url content parts are described via vision_analyze_tool and enriched as text -- same pattern as the Telegram gateway's _enrich_message_with_vision()
Audio: input_audio content parts are transcribed via transcribe_audio() (Whisper/Groq/OpenAI STT) -- same pattern as the Telegram gateway's _enrich_message_with_transcription()
Text: text/input_text parts pass through as-is; plain string content is unchanged (no regression)

This enables any OpenAI-compatible frontend (Open WebUI, oye, LibreChat, etc.) to send images and voice messages through the API server.

Test plan

Text-only messages work as before (regression)
Content array with text parts only -- extracted correctly
image_url with base64 data URI -- vision describes it, agent responds
input_audio with base64 webm/wav -- STT transcribes, agent responds to transcript
Large image (5 MB) doesn't hit body limit
Vision/STT failure doesn't crash the request (graceful fallback message)

The API server's /v1/chat/completions endpoint now handles OpenAI multimodal content arrays instead of dropping non-text parts. **Changes:** - Raise MAX_REQUEST_BYTES from 1 MB to 50 MB (configurable via API_SERVER_MAX_BODY_MB env var) — base64-encoded images easily exceed the old limit, causing silent 413 rejections. - Add _process_multimodal_content() that replicates the Telegram gateway's text-enrichment pattern: - image_url parts → described via vision_analyze_tool, with the local cache path included so the agent can re-examine if needed - input_audio parts → transcribed via transcribe_audio (same Whisper/Groq/OpenAI STT pipeline as Telegram voice messages) - text/input_text parts → passed through as-is - Wire processing into _handle_chat_completions before user_message extraction, so the agent receives enriched plain text. This enables any OpenAI-compatible frontend (Open WebUI, oye, etc.) to send images and voice messages through the API server. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ltimodal, file attachments Local monkey patch on top of upstream NousResearch/hermes-agent. Connects Hermes' API server to Oye's hermes-aware SSE consumer. Four logically distinct features bundled into one commit because they all touch `gateway/platforms/api_server.py` and would conflict with each other on cherry-pick. This commit message is the canonical reference for re-applying the patch after a future `hermes update` reset. Read it end-to-end before re-doing the cherry-pick — the upstream-mirror PRs (NousResearch#4046, NousResearch#4265) are still OPEN so we will keep maintaining this locally for a while. ================================================================ Feature 1 — Reasoning callback in SSE stream ================================================================ Goal: emit `delta.reasoning_content` chunks on the chat-completions SSE stream so Oye renders the agent's thinking in a separate UI element. Wiring: * Add `reasoning_callback=None` parameter to `_create_agent()` and `_run_agent()` (both signature lines and the inner agent constructor call). AIAgent (run_agent.py:521) accepts this parameter natively. * In `_handle_chat_completions`, allocate `_reasoning_q = _q.Queue()`. * Define `_on_reasoning(text)` that pushes onto `_reasoning_q`. * Pass `_on_reasoning` as `reasoning_callback=` into `_run_agent()`. * Pass `reasoning_q=_reasoning_q` into `_write_sse_chat_completion()`. * Add `reasoning_q=None` parameter to `_write_sse_chat_completion()`. * Inside `_write_sse_chat_completion`, define a nested `_drain_side_queues()` that drains `reasoning_q` and emits each text chunk as `data: {"choices":[{"delta":{"reasoning_content": text}}]}`. * Call `_drain_side_queues()` in the SSE main loop both before each poll and on final flush. Upstream status: there is NO reasoning_callback support anywhere in upstream `gateway/platforms/api_server.py`. PR NousResearch#4265 (open) covers this. Without this patch, Oye sees zero reasoning content even though the underlying AIAgent fires reasoning callbacks. ================================================================ Feature 2 — Tool progress callback as a separate SSE event channel ================================================================ Goal: emit `event: tool_progress` SSE custom events for each tool call so Oye renders tool activity badges in a separate UI element (NOT inline markdown in the assistant response). Wiring (parallel to the reasoning wiring above): * Add `tool_progress_callback=None` parameter to `_create_agent()` and `_run_agent()` and pass it through to AIAgent. * Allocate `_progress_q = _q.Queue()` in `_handle_chat_completions`. * Define `_on_tool_progress(event, name=None, preview=None, args=None, **kwargs)` — see "Callback signature" below. * Pass `_on_tool_progress` as `tool_progress_callback=` into `_run_agent()`. * Pass `progress_q=_progress_q` into `_write_sse_chat_completion()`. * Add `progress_q=None` parameter to `_write_sse_chat_completion()`. * Inside `_drain_side_queues()`, drain `progress_q` and emit each item as `event: tool_progress\ndata: {json}\n\n`. Callback signature — IMPORTANT: AIAgent (since upstream commit cc2b56b) calls tool_progress_callback with a 4-arg signature plus optional kwargs: tool_progress_callback("tool.started", name, preview, args) tool_progress_callback("tool.completed", name, None, None, duration=..., is_error=...) tool_progress_callback("_thinking", first_line) An older 3-arg signature `(name, preview, args)` will silently fail with TypeError that gets swallowed at run_agent.py:6207, producing ZERO tool_progress events on the wire. This is the bug we hit on 2026-04-07 after upgrading to v0.7.0. Event filtering — IMPORTANT: Oye renders ONE visual badge per emitted event (`appendThinkingTool` in oye/static/generation-store.js does not dedupe). To avoid duplicate-empty-badge noise, this callback applies these rules: if event == "_thinking": return # internal preview if name and name.startswith("_"): return # internal tool name if event == "tool.started": emit {tool, preview} if event == "tool.completed" and is_error: emit {tool, preview="✗ failed (Xs)"} # tool.completed (success), unknown: drop silently The `✗ failed (Xs)` preview uses the `duration` kwarg from AIAgent and is intentionally visually distinct from any started-event preview so Oye does not render it as another tool invocation. Payload format consumed by Oye: Oye's parser (oye/sse.py + oye/cli_chat.py:_render_tool_progress and oye/static/generation-store.js:appendToolCall/appendThinkingTool) expects exactly: {"tool": str, "preview": str}. Upstream status: PR NousResearch#4092 (`1e59d481`) added a DIFFERENT tool_progress mechanism — it injects tool progress as inline markdown into the main content stream via `_stream_q.put(f"`{emoji} {label}`")`. That mixes tool activity into the assistant's response text and loses the structured-channel UX Oye renders. We replace upstream's `_on_tool_progress` on cherry-pick. Our SSE-channel approach is in PR NousResearch#4265 (open). ================================================================ Feature 3 — Multimodal content preprocessing ================================================================ Goal: accept large multimodal request bodies and preprocess images/audio into text descriptions before the agent sees them. Wiring: * Raise `MAX_REQUEST_BYTES` from 1 MB to 50 MB (configurable via `API_SERVER_MAX_BODY_MB` env var). * Add `_process_multimodal_content(self, user_message_content) -> str` method that: - Parses OpenAI content arrays (list of {type, text|image_url|...}). - Describes images via `vision_analyze_tool`. - Transcribes audio via `transcribe_audio`. - Returns enriched plain text. (Same pattern as the Telegram gateway adapter.) * Wire it into `_handle_chat_completions` BEFORE user_message extraction: `last["content"] = await self._process_multimodal_content( last.get("content", ""))` Upstream status: PR NousResearch#4046 (open). Upstream commit `71e81728` added a DIFFERENT approach (Codex OAuth vision pass-through inside `_CodexCompletionsAdapter`); that only handles images on the `openai-codex` provider and does not cover audio transcription, so it is not a replacement. ================================================================ Feature 4 — File attachment handling for Oye (mold-38) ================================================================ Goal: accept `{type: "file", file: {filename, file_data}}` content parts (used by Oye for PDF/docx/xlsx/csv/etc. uploads), persist them to a sandbox-visible cache, and tell the agent where to find them so it can read them with its terminal toolchain. Without this branch, the loop only handles text/input_text/image_url/ input_audio and silently drops file parts — the agent sees the user's question with no document attached and acts as if nothing was sent. Wiring: * New imports: `base64`, `pathlib.Path`. * New module-level constants (top of file, after MAX_REQUEST_BYTES): OYE_DOCUMENT_CACHE_DIR = Path(\$HERMES_HOME) / 'oye_documents' OYE_SANDBOX_CACHE_PATH = '/home/pn/.hermes/cache/oye-documents' OYE_DOCUMENT_MAX_AGE_SECONDS = 24 * 3600 OYE_INLINE_MAX_BYTES = 100 * 1024 OYE_INLINE_EXTENSIONS = {.md .txt .csv .tsv .json .yaml .yml .xml .html .htm} _OYE_SUPPORTED_DOCUMENT_TYPES = {21 entries: pdf, md, txt, csv, tsv, json, yaml, yml, xml, html, htm, rtf, zip, docx, xlsx, pptx, odt, epub, ipynb} * New module-level helpers (mirroring gateway/platforms/base.py cache_document_from_bytes line for line, just pointed at a different cache dir): _cache_oye_document(data, filename) -> str - mkdir parents - sanitize filename (Path(name).name + strip control chars + fall back to 'document' for empty/./..) - prefix with doc_<uuid12>_ for collision safety - is_relative_to() path-traversal guard - write bytes, return absolute gateway-internal path _to_sandbox_oye_path(p) -> str - replace OYE_DOCUMENT_CACHE_DIR prefix with OYE_SANDBOX_CACHE_PATH - assert prefix matches before substitution; raise on mismatch _cleanup_oye_documents(max_age_seconds=OYE_DOCUMENT_MAX_AGE_SECONDS) -> int - walk OYE_DOCUMENT_CACHE_DIR, unlink files older than threshold - returns count removed; swallows OSError per file * New \`elif ptype == \"file\":\` branch in _process_multimodal_content (joins a new file_descriptions list, inserted into the enriched output between audio_transcripts and text_parts so the agent reads orientation BEFORE the user question): 1. Pull filename and file_data from part['file']. 2. Strip data URL header, base64.b64decode the body. On decode failure, append loud error note and continue. 3. Look extension up in _OYE_SUPPORTED_DOCUMENT_TYPES. If unsupported, append loud note and continue. (Slack/Discord skip silently — for the API-server path we are louder, since there is no other channel for the user to learn the file was dropped.) 4. _cache_oye_document(raw, filename). On error, append loud cache note and continue. 5. _cleanup_oye_documents() — best-effort 24h GC on every write to bound the cache without patching gateway/run.py's cron ticker. 6. _to_sandbox_oye_path(cached_path). 7. Append orientation note in the same shape as image/audio: '[The user attached <name> (<mime>, <kb> KB) at <sandbox path> — read it with the terminal tool when you need to.]' 8. For OYE_INLINE_EXTENSIONS under OYE_INLINE_MAX_BYTES, also append '[Content of <name>]:\\n<text>' (mirrors slack.py:864-877 and discord.py:2366-2379 exactly). Skip on UnicodeDecodeError. Why a separate oye_documents cache instead of reusing document_cache: The upstream document_cache auto-mount in tools/credential_files.py:357 (get_cache_directory_mounts) computes host paths from inside the gateway container. For any non-CreatBot bot, this produces the wrong host path because the bot home is bind-mounted as /home/dev/.hermes inside the gateway via the compose trick (e.g. /home/dev/.hermes-sunshine: /home/dev/.hermes for Sunshine). The docker daemon then bind-mounts /home/dev/.hermes/document_cache from the host — which is CreatBot's parent, not Sunshine's. Image/audio paths have hidden the same bug because vision/transcription run inside the gateway and never use the sandbox mount; document handling is the first flow that exercises the mount end-to-end. Mold-38 sidesteps the bug by using a fully separate, explicitly-mounted cache wired via each bot's terminal.docker_volumes: CreatBot: /home/dev/.hermes/oye_documents:/home/pn/.hermes/cache/oye-documents:rw Sunshine: /home/dev/.hermes-sunshine/oye_documents:/home/pn/.hermes/cache/oye-documents:rw The destination /home/pn/.hermes/cache/oye-documents deliberately differs from the auto-injected /root/.hermes/cache/documents (which is both broken AND unreadable to the sandbox's pn user, since /root is mode 700). The auto-mount is NOT touched by this patch. Follow-up fleet mold (NOT in mold-38) should: - Introduce HERMES_HOST_HOME env var per bot in each compose file. - Patch get_cache_directory_mounts to substitute HERMES_HOME -> HERMES_HOST_HOME when computing host paths. - Migrate Oye from oye_documents back onto the shared cache/documents and collapse _cache_oye_document into the upstream helper. Upstream status: nothing equivalent in api_server.py on origin/main. The OpenAI \`type: file\` content shape is supported by the upstream Chat Completions API spec but no upstream gateway processes it. Worth opening a small PR to upstream the type-set + branch (without the oye_documents sidestep — that part is fleet-specific). ================================================================ Re-applying after a hermes upgrade ================================================================ When \`hermes update\` (or a manual git pull) brings in new upstream commits, this patch needs to be re-applied. Recommended procedure: 1. Save the current monkey-patched file as a reference: cp gateway/platforms/api_server.py /tmp/api_server.MONKEYPATCHED 2. Update main: git checkout main git pull --ff-only origin main # or reset --hard if diverged 3. Try cherry-pick first (will likely conflict on the file above): git cherry-pick <previous-monkey-patch-sha> 4. For each conflict region, the rule is: - Take upstream's NEW additions (session_db, fallback_model, session_id parameters added since the last patch). - Keep our additions (reasoning_callback, _progress_q, _reasoning_q, _drain_side_queues, _process_multimodal_content, MAX_REQUEST_BYTES bump, OYE_DOCUMENT_CACHE_DIR + helpers, the file branch). - Replace upstream's \`_on_tool_progress(name, preview, args)\` (the inline-markdown one from PR NousResearch#4092) with our queue-based version that matches the AIAgent 4-arg signature above. 5. Verify all features after rebuild: a. Hermes syntax check: python3 -c \"import ast; ast.parse(open( 'gateway/platforms/api_server.py').read())\" b. Reinstall venv deps: uv pip install -e \".[all]\" c. Clear bytecode: find . -type d -name __pycache__ -exec rm -rf {} + d. Restart bot with the 75s telegram-polling restart gap (see deploy-hermes skill — \`down\`, sleep 75s, \`up -d\`). e. Test reasoning + tool_progress + file attachments end-to-end via Oye web upload. 6. If cherry-pick is too conflict-prone (>5 hunks), fall back to: diff /tmp/api_server.MONKEYPATCHED gateway/platforms/api_server.py and re-apply additions manually using the feature descriptions in this commit message as your contract. ================================================================ Files touched ================================================================ gateway/platforms/api_server.py # all of the above Nothing else. The patch deliberately stays in one file so the bridge layer stays self-contained and easy to spot in \`git log\`. ================================================================ Related upstream PRs ================================================================ NousResearch#4046 — multimodal content support (still OPEN) NousResearch#4265 — tool_progress + reasoning SSE wiring (still OPEN) When/if either merges, drop the corresponding feature from this commit. File attachment handling (Feature 4) has no upstream PR yet.

…/responses OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes #5621, #8253, #4046, #6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>

…/responses (#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes #5621, #8253, #4046, #6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>

teknium1 · 2026-04-20T11:16:41Z

Closed in favor of #12969 (merged as f683132). Multimodal image inputs are now supported on both /v1/chat/completions and /v1/responses. Audio parts remain out of scope for a future PR. Credited in the merged PR body.

manuelschipper · 2026-04-20T21:05:51Z

Thanks @teknium1 for closing this out via #12969 and for calling out audio as future work. I split the remaining audio piece into a narrower follow-up PR here: #13184.

That PR only adds input_audio support on the final user message for /v1/chat/completions, reuses Hermes’ existing STT path, keeps /v1/responses out of scope, and leaves the merged image-input behavior untouched. Focused tests and docs are included.

…/responses (NousResearch#12969) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text|image_url, image_url: {url, detail?}} Responses: {type: input_text|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes NousResearch#5621, NousResearch#8253, NousResearch#4046, NousResearch#6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>

manuelschipper force-pushed the feat/api-server-multimodal branch from d6bf667 to 39b41f6 Compare March 31, 2026 06:31

sunxyless mentioned this pull request Apr 19, 2026

feat(api-server): support multimodal image uploads via auxiliary vision #12329

Closed

19 tasks

teknium1 mentioned this pull request Apr 20, 2026

feat(api-server): inline image inputs on /v1/chat/completions and /v1/responses #12969

Merged

teknium1 closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api_server): multimodal content support (images + audio)#4046

feat(api_server): multimodal content support (images + audio)#4046
manuelschipper wants to merge 1 commit into
NousResearch:mainfrom
manuelschipper:feat/api-server-multimodal

manuelschipper commented Mar 30, 2026

Uh oh!

teknium1 commented Apr 20, 2026

Uh oh!

manuelschipper commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

manuelschipper commented Mar 30, 2026

Summary

Test plan

Uh oh!

teknium1 commented Apr 20, 2026

Uh oh!

manuelschipper commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants