Skip to content

fix(audio): 修复 TTS 流式 Accept、外部存储与持久化内存占用#1792

Merged
looplj merged 3 commits into
looplj:unstablefrom
EmccK:feat/openai-audio-next
Jun 7, 2026
Merged

fix(audio): 修复 TTS 流式 Accept、外部存储与持久化内存占用#1792
looplj merged 3 commits into
looplj:unstablefrom
EmccK:feat/openai-audio-next

Conversation

@EmccK

@EmccK EmccK commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

背景

#1784 引入 OpenAI Audio API 后,对 /audio/speech 默认走 stream_format=audio 流式路径,并放宽了 DoStream 的 Accept 设置,导致几个回归。本 PR 修复这些问题。

修复内容

1. DoStream 不再正确协商 SSE

llm/httpclient/client.go 之前改成只在 Accept 为空时才写 text/event-stream,但 OpenAI chat outbound 默认设 Accept: application/json,导致普通流式 chat 请求向上游声明的是 JSON 而非 SSE。

修复:Accept 为空 等于 application/json 时都强制改为 text/event-stream,保留 TTS 二进制流 */* 和已显式 text/event-stream 的 outbound。

2. /audio/speech 默认走流式,绕过音频外部存储

llm/transformer/openai/audio_inbound.go 之前缺省 stream_format 时强制 Stream=truestream_format=audio,会绕开 UpdateRequestCompletedWithAudio 的音频外部存储路径,只剩 audio_bytes 摘要,无法下载完整音频。

修复:缺省 stream_format 保持非流式(Stream=false),客户端必须显式 sseaudio 才进流式管线。OpenAI 官方该参数本就是可选的。

3. 空二进制 TTS 流被误判为完成

transformSpeechBinaryChunkbinary.done 转成新构造的 *llm.Response{Object:"[DONE]"},但 checkEmptyResponse 只识别 event == llm.DoneResponse 指针。结果是 200 + 空 body 时不会触发 empty-response retry,最终保存 audio_bytes:0 的 completed 请求。

修复:空响应检测同时识别 event.Object == "[DONE]"hasResponseContent 同步处理。

4. TTS SSE 仅 done、无 delta 时也未触发重试

hasResponseContentSpeechStreamEvent.Type != "" 把纯 speech.audio.done 当成了有效内容。

修复:只把 AudioBase64 != "" 视为有内容,裸 speech.audio.done 不再算内容。

5. 流式音频在持久化缓冲里被全量保留

主持久化流和 LivePreview buffer 都把原始 audio/mpeg chunk append 到 buffer,长音频流在请求结束前会累积完整音频字节多份。

修复:

  • httpclient.StreamEvent 新增 Size 字段
  • 新增 SummarizeBinaryChunk():摘要二进制音频 chunk(清空 Data、保留 Type 与字节数)
  • InboundPersistentStream.CurrentOutboundPersistentStream.CurrentliveRequestStream.NextliveRequestExecutionStream.Next 均在写入持久化/preview buffer 前摘要化;下游消费者仍拿原始事件
  • aggregateSpeechStreamChunksmarshalStreamEventForStoragechunk.Size 兜底统计字节数

测试

  • 新增/更新单测覆盖所有四个修复点(empty binary stream、empty SSE done、Accept 协商、外部存储路径、Size 兜底)
  • 本地以 tts-1 非流式 + gpt-4o-mini-tts (sse / audio) 三条路径手动验证生成音频均正常
  • go test ./... 全部通过

EmccK added 2 commits June 6, 2026 22:24
- DoStream 强制 SSE Accept,避免 chat/completions 流式被覆盖为 application/json
- /audio/speech 缺省 stream_format 时保持非流式,恢复音频外部存储路径
- 空响应检测识别 Object="[DONE]" 和纯 speech.audio.done,触发重试
- 持久化缓冲与 LivePreview buffer 摘要二进制音频 chunk,避免缓存完整音频字节
@greptile-apps

greptile-apps Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes five regressions introduced by #1784's OpenAI Audio API integration: broken SSE Accept-header negotiation, default-streaming bypassing external audio storage, empty binary TTS streams not triggering retries, bare speech.audio.done events counted as content, and full audio payloads accumulating in persistence buffers.

  • Accept header fix (DoStream): forces text/event-stream when the outbound Accept is empty or application/json, preserving explicit */* for binary TTS.
  • Default streaming removed (audio_inbound): omitting stream_format now stays non-streaming so UpdateRequestCompletedWithAudio can persist the full audio file; only explicit \"sse\" or \"audio\" engages streaming.
  • Memory fix (SummarizeBinaryChunk): persistence/live-preview buffers store a size-only summary instead of the raw audio bytes, with chunk.Size used as fallback in aggregators when Data has been elided.
  • Empty-response detection (checkEmptyResponse + hasResponseContent): recognises freshly-constructed {Object:\"[DONE]\"} sentinels and restricts SpeechStreamEvent content check to AudioBase64 != \"\".

Confidence Score: 5/5

Safe to merge. All five bug fixes have targeted, well-isolated changes and the new binary streaming path is covered end-to-end by integration tests.

Each regression is addressed with the minimal necessary change: the Accept-header fix, the default-streaming removal, the binary decoder registration, the memory summarisation at every persistence boundary, and the empty-response sentinel recognition. The test suite covers all four empty-response scenarios and the full binary streaming round-trip. No correctness issues were found during review.

No files require special attention. The most complex new code is binaryChunkDecoder and SummarizeBinaryChunk, both of which are fully unit-tested.

Important Files Changed

Filename Overview
llm/httpclient/model.go Adds Size field, IsBinaryAudioChunk() method, and SummarizeBinaryChunk() to StreamEvent; introduces BinaryStreamDoneEventType constant. Implementation is clean and well-tested.
llm/httpclient/decoder.go Adds binaryChunkDecoder for non-SSE audio streams, with correct EOF/error handling, atomic close, and buffer reuse. Registers decoders for all common audio MIME types.
llm/httpclient/client.go Fixes Accept header negotiation in DoStream: forces text/event-stream when Accept is empty or application/json, preserving explicit values like / for binary TTS. Adds MIME parameter stripping for content-type decoder lookup.
llm/pipeline/stream.go Extends empty-response check to recognize freshly-constructed {Object:"[DONE]"} terminators in addition to the shared llm.DoneResponse sentinel, fixing TTS binary stream empty-retry detection.
llm/pipeline/empty_response.go hasResponseContent now recognizes SpeechAudioChunk and properly rejects bare speech.audio.done events (no audio) as non-content. Also handles {Object:"[DONE]"} response objects.
llm/transformer/openai/audio_inbound.go Default stream_format now stays non-streaming; only explicit "sse" or "audio" engages the streaming pipeline. TransformStream correctly routes binary vs SSE done events and aggregateSpeechStreamChunks uses chunk.Size as fallback for summarized audio chunks.
llm/transformer/openai/audio_outbound.go Adds transformSpeechBinaryChunk and speechStreamChunkTransformFor to handle raw binary provider streams alongside SSE. Binary done sentinel correctly emits an {Object:"[DONE]"} response with RequestType/APIFormat set for downstream routing.
internal/server/orchestrator/inbound.go InboundPersistentStream.Current() now appends summarized binary chunks to responseChunks while returning the original full-data event to consumers. BinaryStreamDoneEventType added to isTerminalStreamEvent.
internal/server/orchestrator/outbound.go OutboundPersistentStream.Current() summarizes binary chunks for persistence. isCompletedAggregated now uses meta.Completed flag, enabling TTS aggregation completion without requiring completion tokens.
internal/server/biz/request.go Adds marshalStreamEventForStorage and shouldSkipStoredStreamChunk to unify chunk serialization, correctly using chunk.Size fallback when binary audio Data has been elided by summarization.
internal/server/api/chat.go ChatCompletion refactored to delegate to new ChatCompletionWithRequest. Adds WriteBinaryStream for raw chunked audio responses and streamErrorStatus for mapping errors to HTTP status codes.
internal/server/api/openai.go CreateSpeech now routes to WriteBinaryStream for stream_format=audio and to the default SSE/non-streaming path otherwise, based on shouldUseBinarySpeechStream routing function.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Client -->|POST /audio/speech| CreateSpeech

    CreateSpeech -->|stream_format absent| NonStream[Non-streaming path\nChatCompletionWithRequest]
    CreateSpeech -->|stream_format=sse| SSEPath[SSE path\nWriteSSEStream]
    CreateSpeech -->|stream_format=audio| BinaryPath[Binary path\nWriteBinaryStream]

    NonStream -->|Do HTTP| ProviderHTTP[Provider HTTP response\nbinary audio body]
    ProviderHTTP -->|UpdateRequestCompletedWithAudio| ExternalStorage[(External Storage)]

    SSEPath -->|DoStream Accept: text/event-stream| ProviderSSE[Provider SSE stream]
    ProviderSSE -->|SSE decoder| SSEDecoder[speech.audio.delta / done events]
    SSEDecoder -->|transformSpeechStreamChunk| LLMResponseSSE[llm.Response SpeechStreamEvent]

    BinaryPath -->|DoStream Accept: */*| ProviderBinary[Provider chunked audio]
    ProviderBinary -->|binaryChunkDecoder| BinaryDecoder[audio/mpeg chunks + binary.done]
    BinaryDecoder -->|transformSpeechBinaryChunk| LLMResponseBin[llm.Response SpeechAudioChunk / DONE]

    LLMResponseSSE --> CheckEmpty{checkEmptyResponse}
    LLMResponseBin --> CheckEmpty

    CheckEmpty -->|has content| InboundTransform[inbound.TransformStream]
    CheckEmpty -->|DONE with no content| Retry[ErrEmptyResponse - retry]

    InboundTransform -->|SpeechAudioChunk| StreamEventBin[StreamEvent Type=audio/mpeg Data=bytes]
    InboundTransform -->|SpeechStreamEvent| StreamEventSSE[StreamEvent Type=speech.audio.delta]
    InboundTransform -->|binary DONE| StreamEventDone[StreamEvent Type=binary.done]

    StreamEventBin -->|InboundPersistentStream| SummarizePersist[SummarizeBinaryChunk\nSize only stored]
    SummarizePersist --> DB[(DB chunks\naudio_bytes count)]
    StreamEventBin -->|WriteBinaryStream| ClientResponse[Binary audio response]

    StreamEventDone -->|WriteBinaryStream skip| ClientResponse
Loading

Reviews (2): Last reviewed commit: "fix: 处理 TTS 流式评审问题" | Re-trigger Greptile

Comment thread llm/httpclient/decoder.go
Comment thread llm/httpclient/decoder.go Outdated
Comment thread internal/server/api/openai.go
@looplj looplj merged commit 7891a8d into looplj:unstable Jun 7, 2026
4 checks passed
junjiangao pushed a commit to junjiangao/axonhub that referenced this pull request Jun 9, 2026
* fix(audio): 修复 TTS 流式 Accept、外部存储与持久化内存占用

- DoStream 强制 SSE Accept,避免 chat/completions 流式被覆盖为 application/json
- /audio/speech 缺省 stream_format 时保持非流式,恢复音频外部存储路径
- 空响应检测识别 Object="[DONE]" 和纯 speech.audio.done,触发重试
- 持久化缓冲与 LivePreview buffer 摘要二进制音频 chunk,避免缓存完整音频字节

* fix(lint): chat_test 中 embedded 字段后补空行

* fix: 处理 TTS 流式评审问题
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants