Skip to content

feat(api_server): accept input_audio chat parts#13184

Open
manuelschipper wants to merge 1 commit into
NousResearch:mainfrom
manuelschipper:feat/api-server-input-audio
Open

feat(api_server): accept input_audio chat parts#13184
manuelschipper wants to merge 1 commit into
NousResearch:mainfrom
manuelschipper:feat/api-server-input-audio

Conversation

@manuelschipper

Copy link
Copy Markdown
Contributor

Summary

  • Accept OpenAI-compatible input_audio content parts on the final user message for /v1/chat/completions.
  • Validate base64 payloads and supported Hermes STT audio formats, then transcribe through the existing STT pipeline before agent entry.
  • Preserve mixed text/image/audio requests by replacing audio parts with transcript text while leaving text and image parts intact.
  • Keep /v1/responses, assistant messages, and non-final user-message audio explicitly rejected with OpenAI-shaped errors.
  • Document the API Server audio-input contract and supported formats.

Why

The API Server already accepts inline image content parts. This adds the audio side of the same OpenAI-compatible content-part surface without introducing a new provider or credential contract, because it reuses Hermes’ existing voice/STT implementation.

Verification

  • scripts/run_tests.sh tests/gateway/test_api_server_multimodal.py tests/gateway/test_api_server.py
  • Result: 155 passed, 91 warnings

Notes

  • Inline audio still uses the API Server request body limit, so larger audio remains outside this payload path.
  • The PR is draft while CI and maintainer review shake out any contract preferences.

@manuelschipper manuelschipper marked this pull request as ready for review April 20, 2026 20:57
@trevorgordon981

Copy link
Copy Markdown

✅ Review Complete - LGTM

Verified - input_audio chat parts implemented correctly

Feature Analysis

input_audio support

  • Accepts chat parts in API server
  • Properly validates audio data and format
  • Integrates with existing multimodal pipeline
  • Supports base64-encoded audio data
  • Format validation (wav, mp3, etc.)

Implementation

  • Clean integration in api_server.py
  • Proper normalization of audio parts
  • Validation before processing
  • Error handling for invalid audio
  • No breaking changes to existing functionality

Impact

  • Enables multimodal chat with audio input
  • Critical for voice-to-text workflows
  • Follows OpenAI API patterns
  • Ready for production use

Recommendation

Merge immediately. This feature:

  1. Adds native audio input support (highly requested)
  2. Clean implementation with proper validation
  3. No breaking changes
  4. Ready for user testing

Suggest adding e2e tests for audio workflows in a follow-up PR, but the core implementation is solid.


Tested on: macOS (Apple Silicon), Python 3.11.15
Date: April 20, 2026

@alt-glitch alt-glitch added comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have tool/tts Text-to-speech and transcription type/feature New feature or request labels Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have tool/tts Text-to-speech and transcription type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants