feat: add OpenAI audio APIs (TTS / STT) with SSE streaming by EmccK · Pull Request #1784 · looplj/axonhub

EmccK · 2026-06-04T13:54:15Z

Summary

This PR adds OpenAI-compatible audio API endpoints to AxonHub:

POST /v1/audio/speech — text-to-speech (TTS), binary audio response
POST /v1/audio/transcriptions — speech-to-text (STT), multipart upload
POST /v1/audio/translations — speech-to-text translation, multipart upload

It also supports SSE streaming for both directions:

TTS stream_format: "sse" (gpt-4o-mini-tts)
STT stream: true (gpt-4o-transcribe)

The binary chunked TTS stream (the OpenAI default without stream_format) is intentionally out of scope — it would require a third response carrier (raw byte stream) on top of the existing JSON / SSE pipeline, and most clients already accept SSE streaming.

Changes

Backend

Unified models

llm/audio.go (new): SpeechRequest, TranscriptionRequest, TranslationRequest, plus their non-streaming responses and SpeechStreamEvent / TranscriptionStreamEvent for SSE.
llm/model.go: Request and Response carry the audio sub-structs and stream events alongside existing Embedding / Image / Video fields.
llm/constants.go: three new request types and three new API formats (openai/audio_speech, openai/audio_transcriptions, openai/audio_translations).

OpenAI transformer

llm/transformer/openai/audio_inbound.go (new): parses JSON (speech) and multipart (STT) requests with fail-fast validation (model, input, voice, file required; invalid temperature / stream rejected; duplicate file parts rejected; client-controlled filenames sanitized). Implements TransformResponse / TransformStream / AggregateStreamChunks.
llm/transformer/openai/audio_outbound.go (new): builds JSON / multipart provider requests, sets the right Accept header (*/* for binary TTS, text/event-stream for SSE), parses provider responses and SSE events.
llm/transformer/openai/outbound.go: TransformRequest / TransformResponse / TransformStream / AggregateStreamChunks dispatch the new audio formats to the dedicated audio code.
Stream-level errors (event: error or {"error":{...}}) flowing in after a 200 response are surfaced as stream errors via parseStreamErrorEvent, instead of being silently decoded as empty audio events.

Pipeline

llm/pipeline/empty_response.go: hasResponseContent recognises Speech.Audio, Transcription.Text/Raw, and the two new stream-event carriers so audio responses do not trigger empty-response retries.
llm/httpclient/client.go: Do only sets the default Accept: application/json when the transformer did not already set one (TTS needs */*).
llm/httpclient/utils.go: Accept is added to blockedHeaders so inbound client values cannot override the transformer-owned Accept.

Delegating channels (NanoGPT / OpenRouter)

llm/transformer/nanogpt/outbound.go and llm/transformer/openrouter/outbound.go: TransformResponse / TransformStream / AggregateStreamChunks delegate audio requests to the embedded OpenAI outbound so transcript / speech events are not parsed as chat completions.

Orchestrator & persistence

internal/server/api/openai.go and routes.go: three new handlers wired into the existing ChatCompletion orchestrator (reusing pipeline, quota, persistence, retries, etc.).
internal/server/orchestrator/select_endpoints.go: per-request-type allow-lists added for speech / transcription / translation.
internal/server/orchestrator/request.go: persistence is JSON-column-safe — TTS binary audio is replaced with a compact metadata placeholder and offloaded to external DataStorage (mirroring how video artifacts are stored), and non-JSON STT bodies (text / srt / vtt) are wrapped into a JSON object. Determined by Content-Type first; sniffing only when absent.
internal/server/orchestrator/request_execution.go: execution-record format uses the actual outbound request APIFormat instead of the wrapper's primary format, so audio executions are tracked correctly (chat-only APIFormat() would otherwise mis-tag them).
internal/server/orchestrator/pass_through.go: multipart audio formats are excluded from body pass-through because the outbound transformer rebuilds the multipart payload with a new boundary; speech keeps pass-through with model-patch support.
internal/server/orchestrator/inbound.go: isTerminalStreamEvent recognises speech.audio.done and transcript.text.done so audio SSE streams that omit [DONE] still mark the request as completed.
internal/server/biz/request.go: new UpdateRequestCompletedWithAudio and GenerateAudioKey for offloading TTS audio to external storage.
internal/server/biz/channel_endpoint.go and channel_llm.go: audio endpoints are exposed as configurable, plus default endpoints for OpenAI, NanoGPT, and OpenRouter (other OpenAI-compatible channels can opt in via custom endpoints — they are not auto-enabled to avoid mis-routing to providers that may not actually support the audio APIs).

Frontend

frontend/src/features/channels/data/schema.ts: register the new API formats.
frontend/src/features/models/components/models-association-dialog.tsx: audio formats appear in the association filter options.
frontend/src/features/models/data/providers.json: gpt-4o-mini-tts added.
frontend/src/features/requests/components/request-detail-content.tsx: inline audio playback for completed TTS requests, with a download fallback and a dedicated load-failure state instead of silently showing "no response data".
frontend/src/features/requests/utils/curl-generator.ts: cURL preview emits -F flags for STT (multipart) instead of JSON -d, and drops the logged Content-Type (carries a stale boundary).
i18n strings added for both en and zh-CN.

Tests

llm/transformer/openai/audio_inbound_test.go, audio_outbound_test.go (new): cover JSON / multipart parsing, fail-fast validation (voice, temperature, duplicate file, unsupported stream_format), verbose_json raw passthrough, content-type-aware response wrapping, multipart filename header-injection defence, end-to-end round trip (non-stream + SSE TTS + SSE STT), and stream-level error propagation.
llm/pipeline/empty_response_test.go: new cases for audio response and stream-event content detection.
llm/httpclient/utils_test.go: Accept header is no longer overridden by the inbound request.
internal/server/orchestrator/request_test.go: audioSafeResponseBody matrix and end-to-end TTS persistence test (binary stored externally, JSON column holds metadata).
internal/server/orchestrator/pass_through_test.go: multipart audio is excluded from pass-through; speech pass-through still patches the model.
internal/server/orchestrator/inbound_test.go: terminal-event detection covers speech.audio.done and transcript.text.done.
internal/server/biz/request_audio_test.go (new): TTS audio offloaded to external DataStorage, request row references it via content_storage_*.
internal/server/biz/channel_endpoint_mapping_test.go, channel_llm_*_test.go: default endpoint maps updated to reflect the restricted audio rollout.

Compatibility

Existing chat / completions / embeddings / image / video routes are unchanged.
Other OpenAI-compatible channels (Vercel, DeepInfra, PPIO, SiliconFlow, AtlasCloud, Aihubmix, Burncloud, GitHub Models) do not receive audio default endpoints automatically; they keep the previous six-endpoint default and can opt in via custom endpoints.
Accept header semantics change for HTTP requests where the outbound transformer set it explicitly — the previous behaviour unconditionally overwrote it with application/json. All in-tree transformers already set their intended Accept explicitly, so the user-visible behaviour is unchanged outside of audio.

Out of scope

TTS default binary chunked stream (stream_format not set): requires a new raw-byte response carrier in the pipeline and the WriteSSEStream writer; tracked as a follow-up. Clients can use stream_format: "sse" for streaming TTS today, or fall back to the non-streaming path.

greptile-apps · 2026-06-04T14:03:01Z

Greptile Summary

This PR wires three new OpenAI-compatible audio endpoints (/v1/audio/speech, /v1/audio/transcriptions, /v1/audio/translations) into the existing pipeline, reusing quota, retry, and persistence infrastructure. SSE streaming is supported for both TTS (stream_format: "sse") and STT (stream: true), while raw binary chunked TTS streaming is explicitly deferred.

Backend: new inbound/outbound transformers with fail-fast validation, multipart parsing, content-type-aware response wrapping, and a dedicated persistence path that offloads binary TTS audio to external storage (mirroring video artifacts) and JSON-wraps non-JSON STT bodies before DB persistence.
Pipeline changes: Accept header ownership is moved to the outbound transformer, audio formats are excluded from body pass-through, and isTerminalStreamEvent is extended to recognise speech.audio.done and transcript.text.done.
Frontend: inline <audio> player with object-URL lifecycle management, a download fallback, and cURL preview updated to emit -F flags for multipart STT requests.

Confidence Score: 5/5

Safe to merge; the new audio endpoints are additive and existing chat/embedding/image/video paths are not modified in behaviour.

The implementation is thorough: binary TTS bodies are intercepted before they can corrupt the JSON column, non-JSON STT responses are wrapped, Accept-header ownership is correctly moved to the outbound transformer, and multipart formats are excluded from body pass-through. The only finding is a missing log warning when TTS audio is dropped because no external storage is configured.

internal/server/biz/request.go (UpdateRequestCompletedWithAudio silent audio drop path); llm/transformer/openai/audio_inbound.go (text-field size limit flagged in prior thread).

Important Files Changed

Filename	Overview
llm/transformer/openai/audio_inbound.go	New inbound transformer for TTS/STT with fail-fast validation, multipart parsing, SSE aggregation, and filename sanitization; text fields use the audio file size limit (26 MB) which was flagged in a previous thread.
llm/transformer/openai/audio_outbound.go	New outbound transformer builds JSON (TTS) and multipart (STT) provider requests, dispatches SSE stream decoding to dedicated parsers, and correctly sets Accept headers for binary vs event-stream responses.
internal/server/biz/request.go	Adds UpdateRequestCompletedWithAudio and GenerateAudioKey; correctly uses xjson.Marshal (which passes []byte through as raw JSON), but silently drops TTS audio bytes without a warning log when no non-primary external storage is configured.
internal/server/orchestrator/request.go	Adds audioSafeResponseBody and audioFilenameForContentType; correctly routes TTS to UpdateRequestCompletedWithAudio and wraps non-JSON STT bodies before DB persistence.
internal/server/orchestrator/request_execution.go	Execution records now use the actual outbound APIFormat instead of the wrapper primary and correctly sanitize binary TTS bodies via audioSafeResponseBody before DB persistence.
internal/server/orchestrator/pass_through.go	Correctly excludes multipart audio formats from body pass-through and adds Speech (JSON body) to the model-patch allow-list.
internal/server/orchestrator/inbound.go	Extends isTerminalStreamEvent to recognise speech.audio.done and transcript.text.done so audio SSE streams without [DONE] are marked completed.
frontend/src/features/requests/components/request-detail-content.tsx	Adds inline audio playback with object-URL lifecycle management and a download fallback; fetchStoredContent dependency array and shared isLoadingAudio state were flagged in prior threads.
llm/httpclient/client.go	Conditionally sets Accept: application/json only when not already set by the transformer, allowing TTS to use / for binary audio responses.
llm/httpclient/utils.go	Adds Accept to blockedHeaders so inbound client values cannot override the transformer-set Accept header.
internal/server/biz/channel_endpoint.go	Introduces openAIFullDefaultEndpoints with audio formats; correctly restricts audio defaults to confirmed-compatible channel types (OpenAI, NanoGPT, OpenRouter) and leaves other compatible channels opt-in only.
internal/server/orchestrator/select_endpoints.go	Adds per-request-type allow-lists for speech, transcription, and translation, preventing audio requests from being routed to chat-only endpoints.
llm/audio.go	Clean unified model definitions for TTS/STT request/response/stream-event types; binary fields tagged json:"-" prevent accidental JSON serialization.
llm/transformer/openai/outbound.go	Correctly dispatches TransformRequest/TransformResponse/TransformStream/AggregateStreamChunks to audio handlers for the three new formats; non-audio paths unchanged.

_{Reviews (2): Last reviewed commit: "feat: 支持 OpenAI 音频 SSE 流式（TTS stream_for..." | Re-trigger Greptile}

新增三个 OpenAI 兼容音频端点： - POST /v1/audio/speech 文本转语音（TTS），响应为二进制音频 - POST /v1/audio/transcriptions 语音转录（STT），multipart 上传 - POST /v1/audio/translations 语音翻译（STT），multipart 上传后端： - llm/audio.go 统一音频请求/响应模型，Extra 字段透传未建模参数 - openai audio inbound/outbound transformer，multipart 解析与重建，文件名清洗防 header 注入，verbose_json 原始字段无损透传 - 空响应检测支持 Speech/Transcription 内容判断 - TTS 二进制音频持久化到外部存储（content_storage_* 字段）， DB 仅存元数据占位符；非 JSON STT 响应包装为 JSON 落库 - 执行记录 format 优先使用实际出站请求的 APIFormat - Accept 头由出站 transformer 持有，禁止入站请求覆盖 - multipart 音频禁用 body pass-through（boundary 不匹配） - 音频默认 endpoint 仅授予 OpenAI/NanoGPT/OpenRouter，其他兼容渠道通过自定义 endpoint 显式开启前端： - 请求详情页音频试听/下载 - STT cURL 生成使用 -F multipart 格式 - 渠道 endpoint 与模型关联对话框支持音频格式 - i18n（en/zh-CN）

新增 SSE 流式支持，二进制 chunked 流形态不在本次范围。 - llm/audio.go: 新增 SpeechStreamEvent / TranscriptionStreamEvent 模型， Speech 请求增加 StreamFormat 字段 - llm/model.go: Response 增加流式事件载体字段 - llm/pipeline: 空响应检测识别流式事件内容 - openai outbound: TransformStream/AggregateStreamChunks 按 APIFormat 分发到音频专用 SSE 解码与聚合 - openai inbound: 解析 stream/stream_format，拒绝 sse 之外的 stream_format，实现 TransformStream/AggregateStreamChunks； STT 的 stream 字段由网关消费，不再透传到 Extra 重复发送 - multipart 请求按需追加 stream=true 与 Accept: text/event-stream - nanogpt/openrouter: 音频请求委托给底层 OpenAI 的 TransformStream 与 AggregateStreamChunks，避免按 chat completion 解析

EmccK · 2026-06-04T14:41:15Z

Hi @looplj, fixed the lint failures and addressed the Greptile review feedback (filename fallback in fetchStoredContent). Could you please re-approve the workflow runs when you have a moment? Thanks!

* feat: 支持 OpenAI 音频 API（TTS/STT）新增三个 OpenAI 兼容音频端点： - POST /v1/audio/speech 文本转语音（TTS），响应为二进制音频 - POST /v1/audio/transcriptions 语音转录（STT），multipart 上传 - POST /v1/audio/translations 语音翻译（STT），multipart 上传后端： - llm/audio.go 统一音频请求/响应模型，Extra 字段透传未建模参数 - openai audio inbound/outbound transformer，multipart 解析与重建，文件名清洗防 header 注入，verbose_json 原始字段无损透传 - 空响应检测支持 Speech/Transcription 内容判断 - TTS 二进制音频持久化到外部存储（content_storage_* 字段）， DB 仅存元数据占位符；非 JSON STT 响应包装为 JSON 落库 - 执行记录 format 优先使用实际出站请求的 APIFormat - Accept 头由出站 transformer 持有，禁止入站请求覆盖 - multipart 音频禁用 body pass-through（boundary 不匹配） - 音频默认 endpoint 仅授予 OpenAI/NanoGPT/OpenRouter，其他兼容渠道通过自定义 endpoint 显式开启前端： - 请求详情页音频试听/下载 - STT cURL 生成使用 -F multipart 格式 - 渠道 endpoint 与模型关联对话框支持音频格式 - i18n（en/zh-CN） * feat: 支持 OpenAI 音频 SSE 流式（TTS stream_format=sse / STT stream=true）新增 SSE 流式支持，二进制 chunked 流形态不在本次范围。 - llm/audio.go: 新增 SpeechStreamEvent / TranscriptionStreamEvent 模型， Speech 请求增加 StreamFormat 字段 - llm/model.go: Response 增加流式事件载体字段 - llm/pipeline: 空响应检测识别流式事件内容 - openai outbound: TransformStream/AggregateStreamChunks 按 APIFormat 分发到音频专用 SSE 解码与聚合 - openai inbound: 解析 stream/stream_format，拒绝 sse 之外的 stream_format，实现 TransformStream/AggregateStreamChunks； STT 的 stream 字段由网关消费，不再透传到 Extra 重复发送 - multipart 请求按需追加 stream=true 与 Accept: text/event-stream - nanogpt/openrouter: 音频请求委托给底层 OpenAI 的 TransformStream 与 AggregateStreamChunks，避免按 chat completion 解析

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread llm/transformer/openai/audio_inbound.go

Comment thread frontend/src/features/requests/components/request-detail-content.tsx

Comment thread llm/transformer/openai/audio_inbound.go

Comment thread frontend/src/features/requests/components/request-detail-content.tsx

EmccK force-pushed the feat/openai-audio branch from 42e0974 to 36fef6a Compare June 4, 2026 14:09

EmccK added 2 commits June 4, 2026 22:10

EmccK force-pushed the feat/openai-audio branch from 36fef6a to 5243a1f Compare June 4, 2026 14:10

looplj merged commit ad096f5 into looplj:unstable Jun 4, 2026
4 checks passed

EmccK deleted the feat/openai-audio branch June 6, 2026 08:45

greptile-apps Bot mentioned this pull request Jun 6, 2026

fix(audio): 修复 TTS 流式 Accept、外部存储与持久化内存占用 #1792

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OpenAI audio APIs (TTS / STT) with SSE streaming#1784

feat: add OpenAI audio APIs (TTS / STT) with SSE streaming#1784
looplj merged 2 commits into
looplj:unstablefrom
EmccK:feat/openai-audio

EmccK commented Jun 4, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EmccK commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EmccK commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Backend

Unified models

OpenAI transformer

Pipeline

Delegating channels (NanoGPT / OpenRouter)

Orchestrator & persistence

Frontend

Tests

Compatibility

Out of scope

Uh oh!

greptile-apps Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EmccK commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EmccK commented Jun 4, 2026 •

edited

Loading

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading