feat: add OpenAI audio APIs (TTS / STT) with SSE streaming#1784
Conversation
Greptile SummaryThis PR wires three new OpenAI-compatible audio endpoints (
Confidence Score: 5/5Safe to merge; the new audio endpoints are additive and existing chat/embedding/image/video paths are not modified in behaviour. The implementation is thorough: binary TTS bodies are intercepted before they can corrupt the JSON column, non-JSON STT responses are wrapped, Accept-header ownership is correctly moved to the outbound transformer, and multipart formats are excluded from body pass-through. The only finding is a missing log warning when TTS audio is dropped because no external storage is configured. internal/server/biz/request.go (UpdateRequestCompletedWithAudio silent audio drop path); llm/transformer/openai/audio_inbound.go (text-field size limit flagged in prior thread). Important Files Changed
Reviews (2): Last reviewed commit: "feat: 支持 OpenAI 音频 SSE 流式(TTS stream_for..." | Re-trigger Greptile |
新增三个 OpenAI 兼容音频端点: - POST /v1/audio/speech 文本转语音(TTS),响应为二进制音频 - POST /v1/audio/transcriptions 语音转录(STT),multipart 上传 - POST /v1/audio/translations 语音翻译(STT),multipart 上传 后端: - llm/audio.go 统一音频请求/响应模型,Extra 字段透传未建模参数 - openai audio inbound/outbound transformer,multipart 解析与重建, 文件名清洗防 header 注入,verbose_json 原始字段无损透传 - 空响应检测支持 Speech/Transcription 内容判断 - TTS 二进制音频持久化到外部存储(content_storage_* 字段), DB 仅存元数据占位符;非 JSON STT 响应包装为 JSON 落库 - 执行记录 format 优先使用实际出站请求的 APIFormat - Accept 头由出站 transformer 持有,禁止入站请求覆盖 - multipart 音频禁用 body pass-through(boundary 不匹配) - 音频默认 endpoint 仅授予 OpenAI/NanoGPT/OpenRouter, 其他兼容渠道通过自定义 endpoint 显式开启 前端: - 请求详情页音频试听/下载 - STT cURL 生成使用 -F multipart 格式 - 渠道 endpoint 与模型关联对话框支持音频格式 - i18n(en/zh-CN)
新增 SSE 流式支持,二进制 chunked 流形态不在本次范围。 - llm/audio.go: 新增 SpeechStreamEvent / TranscriptionStreamEvent 模型, Speech 请求增加 StreamFormat 字段 - llm/model.go: Response 增加流式事件载体字段 - llm/pipeline: 空响应检测识别流式事件内容 - openai outbound: TransformStream/AggregateStreamChunks 按 APIFormat 分发到音频专用 SSE 解码与聚合 - openai inbound: 解析 stream/stream_format,拒绝 sse 之外的 stream_format, 实现 TransformStream/AggregateStreamChunks; STT 的 stream 字段由网关消费,不再透传到 Extra 重复发送 - multipart 请求按需追加 stream=true 与 Accept: text/event-stream - nanogpt/openrouter: 音频请求委托给底层 OpenAI 的 TransformStream 与 AggregateStreamChunks,避免按 chat completion 解析
|
Hi @looplj, fixed the lint failures and addressed the Greptile review feedback (filename fallback in fetchStoredContent). Could you please re-approve the workflow runs when you have a moment? Thanks! |
* feat: 支持 OpenAI 音频 API(TTS/STT) 新增三个 OpenAI 兼容音频端点: - POST /v1/audio/speech 文本转语音(TTS),响应为二进制音频 - POST /v1/audio/transcriptions 语音转录(STT),multipart 上传 - POST /v1/audio/translations 语音翻译(STT),multipart 上传 后端: - llm/audio.go 统一音频请求/响应模型,Extra 字段透传未建模参数 - openai audio inbound/outbound transformer,multipart 解析与重建, 文件名清洗防 header 注入,verbose_json 原始字段无损透传 - 空响应检测支持 Speech/Transcription 内容判断 - TTS 二进制音频持久化到外部存储(content_storage_* 字段), DB 仅存元数据占位符;非 JSON STT 响应包装为 JSON 落库 - 执行记录 format 优先使用实际出站请求的 APIFormat - Accept 头由出站 transformer 持有,禁止入站请求覆盖 - multipart 音频禁用 body pass-through(boundary 不匹配) - 音频默认 endpoint 仅授予 OpenAI/NanoGPT/OpenRouter, 其他兼容渠道通过自定义 endpoint 显式开启 前端: - 请求详情页音频试听/下载 - STT cURL 生成使用 -F multipart 格式 - 渠道 endpoint 与模型关联对话框支持音频格式 - i18n(en/zh-CN) * feat: 支持 OpenAI 音频 SSE 流式(TTS stream_format=sse / STT stream=true) 新增 SSE 流式支持,二进制 chunked 流形态不在本次范围。 - llm/audio.go: 新增 SpeechStreamEvent / TranscriptionStreamEvent 模型, Speech 请求增加 StreamFormat 字段 - llm/model.go: Response 增加流式事件载体字段 - llm/pipeline: 空响应检测识别流式事件内容 - openai outbound: TransformStream/AggregateStreamChunks 按 APIFormat 分发到音频专用 SSE 解码与聚合 - openai inbound: 解析 stream/stream_format,拒绝 sse 之外的 stream_format, 实现 TransformStream/AggregateStreamChunks; STT 的 stream 字段由网关消费,不再透传到 Extra 重复发送 - multipart 请求按需追加 stream=true 与 Accept: text/event-stream - nanogpt/openrouter: 音频请求委托给底层 OpenAI 的 TransformStream 与 AggregateStreamChunks,避免按 chat completion 解析
Closes #1743
Summary
This PR adds OpenAI-compatible audio API endpoints to AxonHub:
POST /v1/audio/speech— text-to-speech (TTS), binary audio responsePOST /v1/audio/transcriptions— speech-to-text (STT), multipart uploadPOST /v1/audio/translations— speech-to-text translation, multipart uploadIt also supports SSE streaming for both directions:
stream_format: "sse"(gpt-4o-mini-tts)stream: true(gpt-4o-transcribe)The binary chunked TTS stream (the OpenAI default without
stream_format) is intentionally out of scope — it would require a third response carrier (raw byte stream) on top of the existing JSON / SSE pipeline, and most clients already accept SSE streaming.Changes
Backend
Unified models
llm/audio.go(new):SpeechRequest,TranscriptionRequest,TranslationRequest, plus their non-streaming responses andSpeechStreamEvent/TranscriptionStreamEventfor SSE.llm/model.go:RequestandResponsecarry the audio sub-structs and stream events alongside existingEmbedding/Image/Videofields.llm/constants.go: three new request types and three new API formats (openai/audio_speech,openai/audio_transcriptions,openai/audio_translations).OpenAI transformer
llm/transformer/openai/audio_inbound.go(new): parses JSON (speech) and multipart (STT) requests with fail-fast validation (model,input,voice,filerequired; invalidtemperature/streamrejected; duplicate file parts rejected; client-controlled filenames sanitized). ImplementsTransformResponse/TransformStream/AggregateStreamChunks.llm/transformer/openai/audio_outbound.go(new): builds JSON / multipart provider requests, sets the rightAcceptheader (*/*for binary TTS,text/event-streamfor SSE), parses provider responses and SSE events.llm/transformer/openai/outbound.go:TransformRequest/TransformResponse/TransformStream/AggregateStreamChunksdispatch the new audio formats to the dedicated audio code.event: erroror{"error":{...}}) flowing in after a 200 response are surfaced as stream errors viaparseStreamErrorEvent, instead of being silently decoded as empty audio events.Pipeline
llm/pipeline/empty_response.go:hasResponseContentrecognisesSpeech.Audio,Transcription.Text/Raw, and the two new stream-event carriers so audio responses do not trigger empty-response retries.llm/httpclient/client.go:Doonly sets the defaultAccept: application/jsonwhen the transformer did not already set one (TTS needs*/*).llm/httpclient/utils.go:Acceptis added toblockedHeadersso inbound client values cannot override the transformer-ownedAccept.Delegating channels (NanoGPT / OpenRouter)
llm/transformer/nanogpt/outbound.goandllm/transformer/openrouter/outbound.go:TransformResponse/TransformStream/AggregateStreamChunksdelegate audio requests to the embedded OpenAI outbound so transcript / speech events are not parsed as chat completions.Orchestrator & persistence
internal/server/api/openai.goandroutes.go: three new handlers wired into the existingChatCompletionorchestrator (reusing pipeline, quota, persistence, retries, etc.).internal/server/orchestrator/select_endpoints.go: per-request-type allow-lists added for speech / transcription / translation.internal/server/orchestrator/request.go: persistence is JSON-column-safe — TTS binary audio is replaced with a compact metadata placeholder and offloaded to externalDataStorage(mirroring how video artifacts are stored), and non-JSON STT bodies (text/srt/vtt) are wrapped into a JSON object. Determined byContent-Typefirst; sniffing only when absent.internal/server/orchestrator/request_execution.go: execution-recordformatuses the actual outbound requestAPIFormatinstead of the wrapper's primary format, so audio executions are tracked correctly (chat-onlyAPIFormat()would otherwise mis-tag them).internal/server/orchestrator/pass_through.go: multipart audio formats are excluded from body pass-through because the outbound transformer rebuilds the multipart payload with a new boundary; speech keeps pass-through with model-patch support.internal/server/orchestrator/inbound.go:isTerminalStreamEventrecognisesspeech.audio.doneandtranscript.text.doneso audio SSE streams that omit[DONE]still mark the request as completed.internal/server/biz/request.go: newUpdateRequestCompletedWithAudioandGenerateAudioKeyfor offloading TTS audio to external storage.internal/server/biz/channel_endpoint.goandchannel_llm.go: audio endpoints are exposed as configurable, plus default endpoints forOpenAI,NanoGPT, andOpenRouter(other OpenAI-compatible channels can opt in via custom endpoints — they are not auto-enabled to avoid mis-routing to providers that may not actually support the audio APIs).Frontend
frontend/src/features/channels/data/schema.ts: register the new API formats.frontend/src/features/models/components/models-association-dialog.tsx: audio formats appear in the association filter options.frontend/src/features/models/data/providers.json:gpt-4o-mini-ttsadded.frontend/src/features/requests/components/request-detail-content.tsx: inline audio playback for completed TTS requests, with a download fallback and a dedicated load-failure state instead of silently showing "no response data".frontend/src/features/requests/utils/curl-generator.ts: cURL preview emits-Fflags for STT (multipart) instead of JSON-d, and drops the loggedContent-Type(carries a stale boundary).Tests
llm/transformer/openai/audio_inbound_test.go,audio_outbound_test.go(new): cover JSON / multipart parsing, fail-fast validation (voice,temperature, duplicatefile, unsupportedstream_format),verbose_jsonraw passthrough, content-type-aware response wrapping, multipart filename header-injection defence, end-to-end round trip (non-stream + SSE TTS + SSE STT), and stream-level error propagation.llm/pipeline/empty_response_test.go: new cases for audio response and stream-event content detection.llm/httpclient/utils_test.go:Acceptheader is no longer overridden by the inbound request.internal/server/orchestrator/request_test.go:audioSafeResponseBodymatrix and end-to-end TTS persistence test (binary stored externally, JSON column holds metadata).internal/server/orchestrator/pass_through_test.go: multipart audio is excluded from pass-through; speech pass-through still patches the model.internal/server/orchestrator/inbound_test.go: terminal-event detection coversspeech.audio.doneandtranscript.text.done.internal/server/biz/request_audio_test.go(new): TTS audio offloaded to externalDataStorage, request row references it viacontent_storage_*.internal/server/biz/channel_endpoint_mapping_test.go,channel_llm_*_test.go: default endpoint maps updated to reflect the restricted audio rollout.Compatibility
Acceptheader semantics change for HTTP requests where the outbound transformer set it explicitly — the previous behaviour unconditionally overwrote it withapplication/json. All in-tree transformers already set their intendedAcceptexplicitly, so the user-visible behaviour is unchanged outside of audio.Out of scope
stream_formatnot set): requires a new raw-byte response carrier in the pipeline and theWriteSSEStreamwriter; tracked as a follow-up. Clients can usestream_format: "sse"for streaming TTS today, or fall back to the non-streaming path.