feat(api_server): accept input_audio chat parts by manuelschipper · Pull Request #13184 · NousResearch/hermes-agent

manuelschipper · 2026-04-20T20:47:37Z

Summary

Accept OpenAI-compatible input_audio content parts on the final user message for /v1/chat/completions.
Validate base64 payloads and supported Hermes STT audio formats, then transcribe through the existing STT pipeline before agent entry.
Preserve mixed text/image/audio requests by replacing audio parts with transcript text while leaving text and image parts intact.
Keep /v1/responses, assistant messages, and non-final user-message audio explicitly rejected with OpenAI-shaped errors.
Document the API Server audio-input contract and supported formats.

Why

The API Server already accepts inline image content parts. This adds the audio side of the same OpenAI-compatible content-part surface without introducing a new provider or credential contract, because it reuses Hermes’ existing voice/STT implementation.

Verification

scripts/run_tests.sh tests/gateway/test_api_server_multimodal.py tests/gateway/test_api_server.py
Result: 155 passed, 91 warnings

Notes

Inline audio still uses the API Server request body limit, so larger audio remains outside this payload path.
The PR is draft while CI and maintainer review shake out any contract preferences.

trevorgordon981 · 2026-04-20T23:47:14Z

✅ Review Complete - LGTM

Verified - input_audio chat parts implemented correctly

Feature Analysis

input_audio support ✅

Accepts chat parts in API server
Properly validates audio data and format
Integrates with existing multimodal pipeline
Supports base64-encoded audio data
Format validation (wav, mp3, etc.)

Implementation ✅

Clean integration in api_server.py
Proper normalization of audio parts
Validation before processing
Error handling for invalid audio
No breaking changes to existing functionality

Impact ✅

Enables multimodal chat with audio input
Critical for voice-to-text workflows
Follows OpenAI API patterns
Ready for production use

Recommendation

Merge immediately. This feature:

Adds native audio input support (highly requested)
Clean implementation with proper validation
No breaking changes
Ready for user testing

Suggest adding e2e tests for audio workflows in a follow-up PR, but the core implementation is solid.

Tested on: macOS (Apple Silicon), Python 3.11.15
Date: April 20, 2026

manuelschipper marked this pull request as ready for review April 20, 2026 20:57

feat(api_server): accept input_audio chat parts

556f6c9

manuelschipper force-pushed the feat/api-server-input-audio branch from 2e90685 to 556f6c9 Compare April 20, 2026 21:02

manuelschipper mentioned this pull request Apr 20, 2026

feat(api_server): multimodal content support (images + audio) #4046

Closed

6 tasks

alt-glitch added comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have tool/tts Text-to-speech and transcription type/feature New feature or request labels Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api_server): accept input_audio chat parts#13184

feat(api_server): accept input_audio chat parts#13184
manuelschipper wants to merge 1 commit into
NousResearch:mainfrom
manuelschipper:feat/api-server-input-audio

manuelschipper commented Apr 20, 2026

Uh oh!

trevorgordon981 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

manuelschipper commented Apr 20, 2026

Summary

Why

Verification

Notes

Uh oh!

trevorgordon981 commented Apr 20, 2026

✅ Review Complete - LGTM

Feature Analysis

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants