Skip to content

[Feature] Add chunk-based streaming ASR for Qwen3-ASR#22089

Merged
mickqian merged 5 commits intosgl-project:mainfrom
SammLSH:feat/qwen3-asr-streaming
Apr 9, 2026
Merged

[Feature] Add chunk-based streaming ASR for Qwen3-ASR#22089
mickqian merged 5 commits intosgl-project:mainfrom
SammLSH:feat/qwen3-asr-streaming

Conversation

@SammLSH
Copy link
Copy Markdown
Contributor

@SammLSH SammLSH commented Apr 4, 2026

Motivation

Issue: #22025 (streaming input design and implementation)

This PR adds chunk-based streaming transcription support for Qwen3-ASR via the existing /v1/audio/transcriptions endpoint with stream=true. Audio is processed in 2-second chunks with prefix rollback, emitting partial transcripts via SSE as each chunk completes — reducing time-to-first-text compared to waiting for the full audio to be processed.

Built on top of #22073 (Qwen3-ASR model support by @adityavaid). This PR is branched from #22073 and should be merged after it. My changes are streaming_asr.py (new), streaming additions in serving_transcription.py, and a config fix in hf_transformers_utils.py.

References

Approach

Based on the Qwen3-ASR streaming algorithm (arXiv:2601.21337):

  • Audio is split into 2-second accumulated chunks (each chunk contains all audio from 0 to current position)
  • Each chunk is processed as an independent transcription request
  • Previous transcript is fed as decoder prefix with rollback (last 5 words dropped) to reduce boundary jitter
  • First 2 chunks are "cold start" — no prefix is injected (model doesn't have enough context yet)
  • Stable text deltas are emitted via SSE word-by-word for smooth output
  • Client disconnection is detected between chunks to avoid wasted computation

This is a server-side chunked streaming approach. vLLM is exploring a complementary client-side approach (vllm-project/vllm#35908) where callers control chunking via a prefix parameter. Both use the same underlying algorithm.

New files

File Purpose
python/sglang/srt/entrypoints/openai/streaming_asr.py StreamingASRState (prefix rollback state management with configurable chunk size, rollback window, cold start period), split_audio_chunks (split audio into accumulated chunks), build_streaming_prompt (construct chat-template prompt with optional prefix text)

Modified files

File What changed
serving_transcription.py Add _generate_chunked_asr_stream method. Route stream=true requests through chunked ASR pipeline when model family is qwen3_asr. Detect client disconnection between chunks via raw_request.is_disconnected().
hf_transformers_utils.py Add Qwen3ASRConfig to _CONFIG_REGISTRY import and registry list (fix for #22073 — without this, AutoConfig.from_pretrained fails with KeyError: 'qwen3_asr').

These are consistent with vLLM's findings (vllm-project/vllm#35767): their Qwen3-ASR realtime endpoint also produces degraded output with stateless segments and no cross-segment context. Quality improves with larger chunk_size_sec (e.g. 2s → 4s, fewer boundaries) at the cost of higher first-text latency.

Tested Cases

Test audio: https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac (11s English speech)

Derived clips generated via:

import soundfile as sf
data, sr = sf.read('/tmp/test_speech.flac')
sf.write('/tmp/short.flac', data[:int(sr*1.5)], sr)   # 1.5s, 1 chunk
sf.write('/tmp/exact.flac', data[:int(sr*2.0)], sr)   # 2.0s, exact boundary
sf.write('/tmp/mid.flac',   data[:int(sr*5.0)], sr)   # 5s, 3 chunks

Chinese test audio (gTTS)

from gtts import gTTS
import librosa, soundfile as sf
gTTS('今天天气真好,我想去公园散步,顺便买一些新鲜的水果回家做饭', lang='zh').save('/tmp/test_zh.mp3')
data, sr = librosa.load('/tmp/test_zh.mp3', sr=16000)
sf.write('/tmp/test_zh.flac', data, sr)

Case Input Result
Offline test_speech.flac (11s) ✅ Full accurate transcription
Streaming, 11s test_speech.flac (6 chunks) ✅ Word-by-word, near-identical to offline (only — → : punctuation diff)
Streaming, 1.5s short.flac (1 chunk) ✅ finalize() single-chunk path works
Streaming, 2.0s exact.flac (chunk boundary) ✅ Edge case handled
Streaming, 5s mid.flac (3 chunks) ✅ Multi-chunk path works
Empty audio touch empty.flac ✅ 400: audio_data is empty
Corrupted audio echo "garbage" > garbage.flac ✅ 400: Format not recognised
response_format=text test_speech.flac ✅ Plain text response
response_format=verbose_json test_speech.flac ✅ JSON with language/duration
Streaming, Chinese audio test_zh.flac (8s, gTTS) ✅ Output correct, but full text returned in single delta (CJK rollback ineffective, known limitation — see StreamingASRState docstring)

Architecture: Why Each Chunk Is an Independent Request

Each chunk is processed as a separate SGLang inference request. This has two consequences:

  1. Repeated encoding: Each chunk re-encodes all accumulated audio from the start. For 10s audio with 2s chunks, total encoder work is 2+4+6+8+10 = 30s instead of 10s (~3x overhead). Acceptable for small models (0.6B encoder < 100ms per chunk) but becomes a bottleneck for long audio.

  2. No shared state across chunks: No KV cache reuse between chunks, no encoder output caching. SGLang's scheduler treats each request independently — no concept of a "streaming session" that persists across requests.

This is the MVP architecture choice: operate purely at the serving layer, no scheduler or model code changes. Keeps the change minimal and reviewable.

Known Limitations (planned follow-ups)

  • Encoder window caching: Cache completed 100-frame encoder windows across chunks to avoid re-encoding (requires model-layer changes to get_audio_feature)
  • Token-level prefix: Current prefix is text-level (string concatenation); token-level prefix injection would be more precise
  • Cross-chunk KV cache reuse: Requires scheduler to support persistent streaming sessions
  • True real-time input: WebSocket endpoint for live audio streaming instead of chunked processing of complete audio
  • Token-level output within chunks: Use stream=True per chunk for smoother output (current implementation emits word-by-word after each chunk completes)
  • prefix API parameter: Add optional prefix parameter to /v1/audio/transcriptions for client-side streaming control (as discussed in [RFC]: Model-specific realtime streaming abstraction vllm-project/vllm#35908)

Checklist

  • Format your code according to the Format code with pre-commit.
  • Add unit tests according to the Run and add unit tests.
  • Update documentation according to Write documentations.
  • Follow the SGLang code style guidance.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@JustinTong0323 JustinTong0323 self-assigned this Apr 4, 2026
@JustinTong0323
Copy link
Copy Markdown
Collaborator

The serving_transcript is somewhat hardcoded for Whisper, we could refactor it a bit.

Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

Thanks for the clean implementation! The server-side chunked streaming approach is well-designed and minimal. Below are some issues found during review, organized by severity.

Critical (3): StopAsyncIteration escape, prompt template duplication, missing input validation
Important (6): prefix text source, language=None, CJK word-splitting, silent None returns, formula duplication, rsplit edge case
Suggestions (4): state invariants, dangerous default, silent fallback, docstring typo

Comment thread python/sglang/srt/entrypoints/openai/serving_transcription.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/streaming_asr.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/streaming_asr.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/streaming_asr.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/streaming_asr.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/serving_transcription.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/serving_transcription.py Outdated
Comment thread python/sglang/srt/configs/qwen3_asr.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/streaming_asr.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/serving_transcription.py
Comment thread python/sglang/srt/configs/model_config.py
Comment thread python/sglang/srt/entrypoints/openai/serving_transcription.py Outdated
Comment on lines +16 to +18
chunk_size_sec: float = 2.0 # Paper: "2-second chunk size"
unfixed_chunk_num: int = 2 # qwen3_asr.py default: 2 paper: 4
unfixed_token_num: int = 5 # Paper: "5-token fallback"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these numbers pre-selected for qwen3-asr only? Could we expose it for user to change?

Copy link
Copy Markdown
Contributor Author

@SammLSH SammLSH Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, these are Qwen3-ASR defaults setting from the paper. The dataclass already supports override via constructor, i think adding different defaults for new ASR models should be straightforward. Exposing as user-facing API parameters can be a follow up pr.

@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch from 42c93e5 to 1de73b9 Compare April 4, 2026 18:35
@SammLSH
Copy link
Copy Markdown
Contributor Author

SammLSH commented Apr 4, 2026

@JustinTong0323 Thanks for the review and detailed feedback. All review feedback addressed. Critical and most Important issues are fixed. Some items (hardcoded or hasattr, PretrainedConfig fallback, rsplit edge case, silent None returns) are from #22073 , left for @adityavaid for now, but happy to fix in this PR if you prefer. Refactoring (strategy pattern, CJK rollback, API parameter exposure) planned as follow ups. Ready for rereview.

@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch from cea7314 to cac24fd Compare April 4, 2026 20:02
@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from 168d7da to a27564e Compare April 6, 2026 21:46
@github-actions github-actions Bot added documentation Improvements or additions to documentation Multi-modal multi-modal language model labels Apr 6, 2026
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the modification of encode_server

@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from e4e5388 to f7a2504 Compare April 7, 2026 07:45
@SammLSH SammLSH requested a review from JustinTong0323 April 7, 2026 07:47
@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch 3 times, most recently from bde404a to 5c45046 Compare April 7, 2026 08:18
@AgainstEntropy
Copy link
Copy Markdown
Collaborator

hi @SammLSH , can you rebase to main and adjust accordingly since #22181 was merged?
Also can you add some documentations for ASR streaming input?

@AgainstEntropy AgainstEntropy self-requested a review April 8, 2026 20:08
@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from 101484f to 969d984 Compare April 8, 2026 22:15
Add chunked streaming transcription for ASR models (e.g. Qwen3-ASR).
Audio is processed in configurable chunks with prefix rollback to
reduce boundary jitter, emitting partial transcripts via SSE.

- Add StreamingASRState for prefix rollback state management
- Add split_audio_chunks utility for cumulative audio chunking
- Extend TranscriptionAdapter with supports_chunked_streaming,
  prompt_template, and chunked_streaming_config
- Route chunked streaming via adapter pattern (no model-specific
  logic in serving layer)
@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch from 969d984 to 918e08d Compare April 8, 2026 23:19
@SammLSH
Copy link
Copy Markdown
Contributor Author

SammLSH commented Apr 8, 2026

Hi @AgainstEntropy, done. I'll add ASR streaming documentation shortly and create a separate follow-up PR for streaming input via WebSocket tonight, with a design document, sample inputs, and tests. While this PR focuses on server side chunked streaming of complete audio uploads, the new PR will build on top of this chunked streaming implementation.

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 9, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@mickqian mickqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding a bench serving test as a follow up. cheers

Comment thread python/sglang/srt/entrypoints/openai/transcription_adapters/qwen3_asr.py Outdated
@SammLSH SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from 2a7a15e to 918e08d Compare April 9, 2026 02:24
Address review feedback: import _DEFAULT_ASR_PROMPT from
multimodal/processors/qwen3_asr.py instead of duplicating it.
Make the prompt template a public constant since it is imported
by the transcription adapter.
@JustinTong0323
Copy link
Copy Markdown
Collaborator

Bug: Streaming output drops spaces at chunk boundaries

Streaming SSE deltas concatenated result vs offline:

  • Offline: He hoped there would be stew for dinner—turnips...
  • Streaming: He hopedthere would be stew fordinner: turnips...

hopedthere, fordinner, andbruised, tobe — spaces missing at chunk boundaries.

Root cause in _generate_chunked_asr_stream:

for j, word in enumerate(delta.split(" ")):
    if not word:
        continue
    content = word if j == 0 else " " + word

When j == 0, no leading space is added. Within a single delta this is fine, but across chunk boundaries the last word of chunk N and first word of chunk N+1 get concatenated without a separator.

Repro:

wget -O /tmp/test.flac "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"
python3 -m sglang.launch_server --model-path Qwen/Qwen3-ASR-0.6B --port 38080 --disable-cuda-graph

# Compare:
curl -s localhost:38080/v1/audio/transcriptions -F file=@/tmp/test.flac -F model=Qwen/Qwen3-ASR-0.6B -F response_format=text
curl -s localhost:38080/v1/audio/transcriptions -F file=@/tmp/test.flac -F model=Qwen/Qwen3-ASR-0.6B -F stream=true | grep -oP '"content":"[^"]*"' | sed 's/"content":"//;s/"$//' | tr -d '\n'

The first word of each chunk delta was missing a leading space,
causing words at chunk boundaries to concatenate without separator
(e.g. "hopedthere" instead of "hoped there").

Track a first_word flag across all chunks so only the very first
word of the entire stream omits the leading space.
Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested streaming ASR on Qwen3-ASR-0.6B. Space-dropping bug at chunk boundaries is fixed. Streaming vs offline diff is now only punctuation (expected). LGTM.

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@mickqian mickqian merged commit 8b991d9 into sgl-project:main Apr 9, 2026
134 of 157 checks passed
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants