[Feature] Add chunk-based streaming ASR for Qwen3-ASR#22089
[Feature] Add chunk-based streaming ASR for Qwen3-ASR#22089mickqian merged 5 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
The serving_transcript is somewhat hardcoded for Whisper, we could refactor it a bit. |
JustinTong0323
left a comment
There was a problem hiding this comment.
PR Review Summary
Thanks for the clean implementation! The server-side chunked streaming approach is well-designed and minimal. Below are some issues found during review, organized by severity.
Critical (3): StopAsyncIteration escape, prompt template duplication, missing input validation
Important (6): prefix text source, language=None, CJK word-splitting, silent None returns, formula duplication, rsplit edge case
Suggestions (4): state invariants, dangerous default, silent fallback, docstring typo
| chunk_size_sec: float = 2.0 # Paper: "2-second chunk size" | ||
| unfixed_chunk_num: int = 2 # qwen3_asr.py default: 2 paper: 4 | ||
| unfixed_token_num: int = 5 # Paper: "5-token fallback" |
There was a problem hiding this comment.
Are these numbers pre-selected for qwen3-asr only? Could we expose it for user to change?
There was a problem hiding this comment.
Right, these are Qwen3-ASR defaults setting from the paper. The dataclass already supports override via constructor, i think adding different defaults for new ASR models should be straightforward. Exposing as user-facing API parameters can be a follow up pr.
42c93e5 to
1de73b9
Compare
|
@JustinTong0323 Thanks for the review and detailed feedback. All review feedback addressed. Critical and most Important issues are fixed. Some items (hardcoded or hasattr, PretrainedConfig fallback, rsplit edge case, silent None returns) are from #22073 , left for @adityavaid for now, but happy to fix in this PR if you prefer. Refactoring (strategy pattern, CJK rollback, API parameter exposure) planned as follow ups. Ready for rereview. |
cea7314 to
cac24fd
Compare
168d7da to
a27564e
Compare
There was a problem hiding this comment.
LGTM for the modification of encode_server
e4e5388 to
f7a2504
Compare
bde404a to
5c45046
Compare
101484f to
969d984
Compare
Add chunked streaming transcription for ASR models (e.g. Qwen3-ASR). Audio is processed in configurable chunks with prefix rollback to reduce boundary jitter, emitting partial transcripts via SSE. - Add StreamingASRState for prefix rollback state management - Add split_audio_chunks utility for cumulative audio chunking - Extend TranscriptionAdapter with supports_chunked_streaming, prompt_template, and chunked_streaming_config - Route chunked streaming via adapter pattern (no model-specific logic in serving layer)
969d984 to
918e08d
Compare
|
Hi @AgainstEntropy, done. I'll add ASR streaming documentation shortly and create a separate follow-up PR for streaming input via WebSocket tonight, with a design document, sample inputs, and tests. While this PR focuses on server side chunked streaming of complete audio uploads, the new PR will build on top of this chunked streaming implementation. |
|
/tag-and-rerun-ci |
mickqian
left a comment
There was a problem hiding this comment.
consider adding a bench serving test as a follow up. cheers
2a7a15e to
918e08d
Compare
Address review feedback: import _DEFAULT_ASR_PROMPT from multimodal/processors/qwen3_asr.py instead of duplicating it.
Make the prompt template a public constant since it is imported by the transcription adapter.
Bug: Streaming output drops spaces at chunk boundariesStreaming SSE deltas concatenated result vs offline:
Root cause in for j, word in enumerate(delta.split(" ")):
if not word:
continue
content = word if j == 0 else " " + wordWhen Repro: wget -O /tmp/test.flac "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"
python3 -m sglang.launch_server --model-path Qwen/Qwen3-ASR-0.6B --port 38080 --disable-cuda-graph
# Compare:
curl -s localhost:38080/v1/audio/transcriptions -F file=@/tmp/test.flac -F model=Qwen/Qwen3-ASR-0.6B -F response_format=text
curl -s localhost:38080/v1/audio/transcriptions -F file=@/tmp/test.flac -F model=Qwen/Qwen3-ASR-0.6B -F stream=true | grep -oP '"content":"[^"]*"' | sed 's/"content":"//;s/"$//' | tr -d '\n' |
The first word of each chunk delta was missing a leading space, causing words at chunk boundaries to concatenate without separator (e.g. "hopedthere" instead of "hoped there"). Track a first_word flag across all chunks so only the very first word of the entire stream omits the leading space.
JustinTong0323
left a comment
There was a problem hiding this comment.
Tested streaming ASR on Qwen3-ASR-0.6B. Space-dropping bug at chunk boundaries is fixed. Streaming vs offline diff is now only punctuation (expected). LGTM.
|
/tag-and-rerun-ci |
Motivation
Issue: #22025 (streaming input design and implementation)
This PR adds chunk-based streaming transcription support for Qwen3-ASR via the existing
/v1/audio/transcriptionsendpoint withstream=true. Audio is processed in 2-second chunks with prefix rollback, emitting partial transcripts via SSE as each chunk completes — reducing time-to-first-text compared to waiting for the full audio to be processed.Built on top of #22073 (Qwen3-ASR model support by @adityavaid). This PR is branched from #22073 and should be merged after it. My changes are streaming_asr.py (new), streaming additions in serving_transcription.py, and a config fix in hf_transformers_utils.py.
References
qwen_asr/inference/qwen3_asr.pyApproach
Based on the Qwen3-ASR streaming algorithm (arXiv:2601.21337):
This is a server-side chunked streaming approach. vLLM is exploring a complementary client-side approach (vllm-project/vllm#35908) where callers control chunking via a
prefixparameter. Both use the same underlying algorithm.New files
Modified files
serving_transcription.py_generate_chunked_asr_streammethod. Routestream=truerequests through chunked ASR pipeline when model family isqwen3_asr. Detect client disconnection between chunks viaraw_request.is_disconnected().hf_transformers_utils.pyQwen3ASRConfigto_CONFIG_REGISTRYimport and registry list (fix for #22073 — without this,AutoConfig.from_pretrainedfails withKeyError: 'qwen3_asr').These are consistent with vLLM's findings (vllm-project/vllm#35767): their Qwen3-ASR realtime endpoint also produces degraded output with stateless segments and no cross-segment context. Quality improves with larger
chunk_size_sec(e.g. 2s → 4s, fewer boundaries) at the cost of higher first-text latency.Tested Cases
Test audio: https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac (11s English speech)
Derived clips generated via:
test_zh.flac(8s, gTTS)StreamingASRStatedocstring)Architecture: Why Each Chunk Is an Independent Request
Each chunk is processed as a separate SGLang inference request. This has two consequences:
Repeated encoding: Each chunk re-encodes all accumulated audio from the start. For 10s audio with 2s chunks, total encoder work is 2+4+6+8+10 = 30s instead of 10s (~3x overhead). Acceptable for small models (0.6B encoder < 100ms per chunk) but becomes a bottleneck for long audio.
No shared state across chunks: No KV cache reuse between chunks, no encoder output caching. SGLang's scheduler treats each request independently — no concept of a "streaming session" that persists across requests.
This is the MVP architecture choice: operate purely at the serving layer, no scheduler or model code changes. Keeps the change minimal and reviewable.
Known Limitations (planned follow-ups)
get_audio_feature)stream=Trueper chunk for smoother output (current implementation emits word-by-word after each chunk completes)prefixAPI parameter: Add optionalprefixparameter to/v1/audio/transcriptionsfor client-side streaming control (as discussed in [RFC]: Model-specific realtime streaming abstraction vllm-project/vllm#35908)Checklist