Skip to content

feat: add OpenAI audio APIs (TTS / STT) with SSE streaming#1784

Merged
looplj merged 2 commits into
looplj:unstablefrom
EmccK:feat/openai-audio
Jun 4, 2026
Merged

feat: add OpenAI audio APIs (TTS / STT) with SSE streaming#1784
looplj merged 2 commits into
looplj:unstablefrom
EmccK:feat/openai-audio

Conversation

@EmccK

@EmccK EmccK commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Closes #1743

Summary

This PR adds OpenAI-compatible audio API endpoints to AxonHub:

  • POST /v1/audio/speech — text-to-speech (TTS), binary audio response
  • POST /v1/audio/transcriptions — speech-to-text (STT), multipart upload
  • POST /v1/audio/translations — speech-to-text translation, multipart upload

It also supports SSE streaming for both directions:

  • TTS stream_format: "sse" (gpt-4o-mini-tts)
  • STT stream: true (gpt-4o-transcribe)

The binary chunked TTS stream (the OpenAI default without stream_format) is intentionally out of scope — it would require a third response carrier (raw byte stream) on top of the existing JSON / SSE pipeline, and most clients already accept SSE streaming.

Changes

Backend

Unified models

  • llm/audio.go (new): SpeechRequest, TranscriptionRequest, TranslationRequest, plus their non-streaming responses and SpeechStreamEvent / TranscriptionStreamEvent for SSE.
  • llm/model.go: Request and Response carry the audio sub-structs and stream events alongside existing Embedding / Image / Video fields.
  • llm/constants.go: three new request types and three new API formats (openai/audio_speech, openai/audio_transcriptions, openai/audio_translations).

OpenAI transformer

  • llm/transformer/openai/audio_inbound.go (new): parses JSON (speech) and multipart (STT) requests with fail-fast validation (model, input, voice, file required; invalid temperature / stream rejected; duplicate file parts rejected; client-controlled filenames sanitized). Implements TransformResponse / TransformStream / AggregateStreamChunks.
  • llm/transformer/openai/audio_outbound.go (new): builds JSON / multipart provider requests, sets the right Accept header (*/* for binary TTS, text/event-stream for SSE), parses provider responses and SSE events.
  • llm/transformer/openai/outbound.go: TransformRequest / TransformResponse / TransformStream / AggregateStreamChunks dispatch the new audio formats to the dedicated audio code.
  • Stream-level errors (event: error or {"error":{...}}) flowing in after a 200 response are surfaced as stream errors via parseStreamErrorEvent, instead of being silently decoded as empty audio events.

Pipeline

  • llm/pipeline/empty_response.go: hasResponseContent recognises Speech.Audio, Transcription.Text/Raw, and the two new stream-event carriers so audio responses do not trigger empty-response retries.
  • llm/httpclient/client.go: Do only sets the default Accept: application/json when the transformer did not already set one (TTS needs */*).
  • llm/httpclient/utils.go: Accept is added to blockedHeaders so inbound client values cannot override the transformer-owned Accept.

Delegating channels (NanoGPT / OpenRouter)

  • llm/transformer/nanogpt/outbound.go and llm/transformer/openrouter/outbound.go: TransformResponse / TransformStream / AggregateStreamChunks delegate audio requests to the embedded OpenAI outbound so transcript / speech events are not parsed as chat completions.

Orchestrator & persistence

  • internal/server/api/openai.go and routes.go: three new handlers wired into the existing ChatCompletion orchestrator (reusing pipeline, quota, persistence, retries, etc.).
  • internal/server/orchestrator/select_endpoints.go: per-request-type allow-lists added for speech / transcription / translation.
  • internal/server/orchestrator/request.go: persistence is JSON-column-safe — TTS binary audio is replaced with a compact metadata placeholder and offloaded to external DataStorage (mirroring how video artifacts are stored), and non-JSON STT bodies (text / srt / vtt) are wrapped into a JSON object. Determined by Content-Type first; sniffing only when absent.
  • internal/server/orchestrator/request_execution.go: execution-record format uses the actual outbound request APIFormat instead of the wrapper's primary format, so audio executions are tracked correctly (chat-only APIFormat() would otherwise mis-tag them).
  • internal/server/orchestrator/pass_through.go: multipart audio formats are excluded from body pass-through because the outbound transformer rebuilds the multipart payload with a new boundary; speech keeps pass-through with model-patch support.
  • internal/server/orchestrator/inbound.go: isTerminalStreamEvent recognises speech.audio.done and transcript.text.done so audio SSE streams that omit [DONE] still mark the request as completed.
  • internal/server/biz/request.go: new UpdateRequestCompletedWithAudio and GenerateAudioKey for offloading TTS audio to external storage.
  • internal/server/biz/channel_endpoint.go and channel_llm.go: audio endpoints are exposed as configurable, plus default endpoints for OpenAI, NanoGPT, and OpenRouter (other OpenAI-compatible channels can opt in via custom endpoints — they are not auto-enabled to avoid mis-routing to providers that may not actually support the audio APIs).

Frontend

  • frontend/src/features/channels/data/schema.ts: register the new API formats.
  • frontend/src/features/models/components/models-association-dialog.tsx: audio formats appear in the association filter options.
  • frontend/src/features/models/data/providers.json: gpt-4o-mini-tts added.
  • frontend/src/features/requests/components/request-detail-content.tsx: inline audio playback for completed TTS requests, with a download fallback and a dedicated load-failure state instead of silently showing "no response data".
  • frontend/src/features/requests/utils/curl-generator.ts: cURL preview emits -F flags for STT (multipart) instead of JSON -d, and drops the logged Content-Type (carries a stale boundary).
  • i18n strings added for both en and zh-CN.

Tests

  • llm/transformer/openai/audio_inbound_test.go, audio_outbound_test.go (new): cover JSON / multipart parsing, fail-fast validation (voice, temperature, duplicate file, unsupported stream_format), verbose_json raw passthrough, content-type-aware response wrapping, multipart filename header-injection defence, end-to-end round trip (non-stream + SSE TTS + SSE STT), and stream-level error propagation.
  • llm/pipeline/empty_response_test.go: new cases for audio response and stream-event content detection.
  • llm/httpclient/utils_test.go: Accept header is no longer overridden by the inbound request.
  • internal/server/orchestrator/request_test.go: audioSafeResponseBody matrix and end-to-end TTS persistence test (binary stored externally, JSON column holds metadata).
  • internal/server/orchestrator/pass_through_test.go: multipart audio is excluded from pass-through; speech pass-through still patches the model.
  • internal/server/orchestrator/inbound_test.go: terminal-event detection covers speech.audio.done and transcript.text.done.
  • internal/server/biz/request_audio_test.go (new): TTS audio offloaded to external DataStorage, request row references it via content_storage_*.
  • internal/server/biz/channel_endpoint_mapping_test.go, channel_llm_*_test.go: default endpoint maps updated to reflect the restricted audio rollout.

Compatibility

  • Existing chat / completions / embeddings / image / video routes are unchanged.
  • Other OpenAI-compatible channels (Vercel, DeepInfra, PPIO, SiliconFlow, AtlasCloud, Aihubmix, Burncloud, GitHub Models) do not receive audio default endpoints automatically; they keep the previous six-endpoint default and can opt in via custom endpoints.
  • Accept header semantics change for HTTP requests where the outbound transformer set it explicitly — the previous behaviour unconditionally overwrote it with application/json. All in-tree transformers already set their intended Accept explicitly, so the user-visible behaviour is unchanged outside of audio.

Out of scope

  • TTS default binary chunked stream (stream_format not set): requires a new raw-byte response carrier in the pipeline and the WriteSSEStream writer; tracked as a follow-up. Clients can use stream_format: "sse" for streaming TTS today, or fall back to the non-streaming path.

@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR wires three new OpenAI-compatible audio endpoints (/v1/audio/speech, /v1/audio/transcriptions, /v1/audio/translations) into the existing pipeline, reusing quota, retry, and persistence infrastructure. SSE streaming is supported for both TTS (stream_format: "sse") and STT (stream: true), while raw binary chunked TTS streaming is explicitly deferred.

  • Backend: new inbound/outbound transformers with fail-fast validation, multipart parsing, content-type-aware response wrapping, and a dedicated persistence path that offloads binary TTS audio to external storage (mirroring video artifacts) and JSON-wraps non-JSON STT bodies before DB persistence.
  • Pipeline changes: Accept header ownership is moved to the outbound transformer, audio formats are excluded from body pass-through, and isTerminalStreamEvent is extended to recognise speech.audio.done and transcript.text.done.
  • Frontend: inline <audio> player with object-URL lifecycle management, a download fallback, and cURL preview updated to emit -F flags for multipart STT requests.

Confidence Score: 5/5

Safe to merge; the new audio endpoints are additive and existing chat/embedding/image/video paths are not modified in behaviour.

The implementation is thorough: binary TTS bodies are intercepted before they can corrupt the JSON column, non-JSON STT responses are wrapped, Accept-header ownership is correctly moved to the outbound transformer, and multipart formats are excluded from body pass-through. The only finding is a missing log warning when TTS audio is dropped because no external storage is configured.

internal/server/biz/request.go (UpdateRequestCompletedWithAudio silent audio drop path); llm/transformer/openai/audio_inbound.go (text-field size limit flagged in prior thread).

Important Files Changed

Filename Overview
llm/transformer/openai/audio_inbound.go New inbound transformer for TTS/STT with fail-fast validation, multipart parsing, SSE aggregation, and filename sanitization; text fields use the audio file size limit (26 MB) which was flagged in a previous thread.
llm/transformer/openai/audio_outbound.go New outbound transformer builds JSON (TTS) and multipart (STT) provider requests, dispatches SSE stream decoding to dedicated parsers, and correctly sets Accept headers for binary vs event-stream responses.
internal/server/biz/request.go Adds UpdateRequestCompletedWithAudio and GenerateAudioKey; correctly uses xjson.Marshal (which passes []byte through as raw JSON), but silently drops TTS audio bytes without a warning log when no non-primary external storage is configured.
internal/server/orchestrator/request.go Adds audioSafeResponseBody and audioFilenameForContentType; correctly routes TTS to UpdateRequestCompletedWithAudio and wraps non-JSON STT bodies before DB persistence.
internal/server/orchestrator/request_execution.go Execution records now use the actual outbound APIFormat instead of the wrapper primary and correctly sanitize binary TTS bodies via audioSafeResponseBody before DB persistence.
internal/server/orchestrator/pass_through.go Correctly excludes multipart audio formats from body pass-through and adds Speech (JSON body) to the model-patch allow-list.
internal/server/orchestrator/inbound.go Extends isTerminalStreamEvent to recognise speech.audio.done and transcript.text.done so audio SSE streams without [DONE] are marked completed.
frontend/src/features/requests/components/request-detail-content.tsx Adds inline audio playback with object-URL lifecycle management and a download fallback; fetchStoredContent dependency array and shared isLoadingAudio state were flagged in prior threads.
llm/httpclient/client.go Conditionally sets Accept: application/json only when not already set by the transformer, allowing TTS to use / for binary audio responses.
llm/httpclient/utils.go Adds Accept to blockedHeaders so inbound client values cannot override the transformer-set Accept header.
internal/server/biz/channel_endpoint.go Introduces openAIFullDefaultEndpoints with audio formats; correctly restricts audio defaults to confirmed-compatible channel types (OpenAI, NanoGPT, OpenRouter) and leaves other compatible channels opt-in only.
internal/server/orchestrator/select_endpoints.go Adds per-request-type allow-lists for speech, transcription, and translation, preventing audio requests from being routed to chat-only endpoints.
llm/audio.go Clean unified model definitions for TTS/STT request/response/stream-event types; binary fields tagged json:"-" prevent accidental JSON serialization.
llm/transformer/openai/outbound.go Correctly dispatches TransformRequest/TransformResponse/TransformStream/AggregateStreamChunks to audio handlers for the three new formats; non-audio paths unchanged.

Reviews (2): Last reviewed commit: "feat: 支持 OpenAI 音频 SSE 流式(TTS stream_for..." | Re-trigger Greptile

Comment thread llm/transformer/openai/audio_inbound.go
Comment thread llm/transformer/openai/audio_inbound.go
@EmccK EmccK force-pushed the feat/openai-audio branch from 42e0974 to 36fef6a Compare June 4, 2026 14:09
EmccK added 2 commits June 4, 2026 22:10
新增三个 OpenAI 兼容音频端点:
- POST /v1/audio/speech 文本转语音(TTS),响应为二进制音频
- POST /v1/audio/transcriptions 语音转录(STT),multipart 上传
- POST /v1/audio/translations 语音翻译(STT),multipart 上传

后端:
- llm/audio.go 统一音频请求/响应模型,Extra 字段透传未建模参数
- openai audio inbound/outbound transformer,multipart 解析与重建,
  文件名清洗防 header 注入,verbose_json 原始字段无损透传
- 空响应检测支持 Speech/Transcription 内容判断
- TTS 二进制音频持久化到外部存储(content_storage_* 字段),
  DB 仅存元数据占位符;非 JSON STT 响应包装为 JSON 落库
- 执行记录 format 优先使用实际出站请求的 APIFormat
- Accept 头由出站 transformer 持有,禁止入站请求覆盖
- multipart 音频禁用 body pass-through(boundary 不匹配)
- 音频默认 endpoint 仅授予 OpenAI/NanoGPT/OpenRouter,
  其他兼容渠道通过自定义 endpoint 显式开启

前端:
- 请求详情页音频试听/下载
- STT cURL 生成使用 -F multipart 格式
- 渠道 endpoint 与模型关联对话框支持音频格式
- i18n(en/zh-CN)
新增 SSE 流式支持,二进制 chunked 流形态不在本次范围。

- llm/audio.go: 新增 SpeechStreamEvent / TranscriptionStreamEvent 模型,
  Speech 请求增加 StreamFormat 字段
- llm/model.go: Response 增加流式事件载体字段
- llm/pipeline: 空响应检测识别流式事件内容
- openai outbound: TransformStream/AggregateStreamChunks 按 APIFormat
  分发到音频专用 SSE 解码与聚合
- openai inbound: 解析 stream/stream_format,拒绝 sse 之外的 stream_format,
  实现 TransformStream/AggregateStreamChunks;
  STT 的 stream 字段由网关消费,不再透传到 Extra 重复发送
- multipart 请求按需追加 stream=true 与 Accept: text/event-stream
- nanogpt/openrouter: 音频请求委托给底层 OpenAI 的 TransformStream
  与 AggregateStreamChunks,避免按 chat completion 解析
@EmccK EmccK force-pushed the feat/openai-audio branch from 36fef6a to 5243a1f Compare June 4, 2026 14:10
@EmccK

EmccK commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Hi @looplj, fixed the lint failures and addressed the Greptile review feedback (filename fallback in fetchStoredContent). Could you please re-approve the workflow runs when you have a moment? Thanks!

@looplj looplj merged commit ad096f5 into looplj:unstable Jun 4, 2026
4 checks passed
junjiangao pushed a commit to junjiangao/axonhub that referenced this pull request Jun 5, 2026
* feat: 支持 OpenAI 音频 API(TTS/STT)

新增三个 OpenAI 兼容音频端点:
- POST /v1/audio/speech 文本转语音(TTS),响应为二进制音频
- POST /v1/audio/transcriptions 语音转录(STT),multipart 上传
- POST /v1/audio/translations 语音翻译(STT),multipart 上传

后端:
- llm/audio.go 统一音频请求/响应模型,Extra 字段透传未建模参数
- openai audio inbound/outbound transformer,multipart 解析与重建,
  文件名清洗防 header 注入,verbose_json 原始字段无损透传
- 空响应检测支持 Speech/Transcription 内容判断
- TTS 二进制音频持久化到外部存储(content_storage_* 字段),
  DB 仅存元数据占位符;非 JSON STT 响应包装为 JSON 落库
- 执行记录 format 优先使用实际出站请求的 APIFormat
- Accept 头由出站 transformer 持有,禁止入站请求覆盖
- multipart 音频禁用 body pass-through(boundary 不匹配)
- 音频默认 endpoint 仅授予 OpenAI/NanoGPT/OpenRouter,
  其他兼容渠道通过自定义 endpoint 显式开启

前端:
- 请求详情页音频试听/下载
- STT cURL 生成使用 -F multipart 格式
- 渠道 endpoint 与模型关联对话框支持音频格式
- i18n(en/zh-CN)

* feat: 支持 OpenAI 音频 SSE 流式(TTS stream_format=sse / STT stream=true)

新增 SSE 流式支持,二进制 chunked 流形态不在本次范围。

- llm/audio.go: 新增 SpeechStreamEvent / TranscriptionStreamEvent 模型,
  Speech 请求增加 StreamFormat 字段
- llm/model.go: Response 增加流式事件载体字段
- llm/pipeline: 空响应检测识别流式事件内容
- openai outbound: TransformStream/AggregateStreamChunks 按 APIFormat
  分发到音频专用 SSE 解码与聚合
- openai inbound: 解析 stream/stream_format,拒绝 sse 之外的 stream_format,
  实现 TransformStream/AggregateStreamChunks;
  STT 的 stream 字段由网关消费,不再透传到 Extra 重复发送
- multipart 请求按需追加 stream=true 与 Accept: text/event-stream
- nanogpt/openrouter: 音频请求委托给底层 OpenAI 的 TransformStream
  与 AggregateStreamChunks,避免按 chat completion 解析
@EmccK EmccK deleted the feat/openai-audio branch June 6, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature/功能]: 希望可以支持TTS以及STT模型

2 participants