[Feature] Add chunk-based streaming ASR for Qwen3-ASR by SammLSH · Pull Request #22089 · sgl-project/sglang

SammLSH · 2026-04-04T02:33:32Z

Motivation

Issue: #22025 (streaming input design and implementation)

This PR adds chunk-based streaming transcription support for Qwen3-ASR via the existing /v1/audio/transcriptions endpoint with stream=true. Audio is processed in 2-second chunks with prefix rollback, emitting partial transcripts via SSE as each chunk completes — reducing time-to-first-text compared to waiting for the full audio to be processed.

Built on top of #22073 (Qwen3-ASR model support by @adityavaid). This PR is branched from #22073 and should be merged after it. My changes are streaming_asr.py (new), streaming additions in serving_transcription.py, and a config fix in hf_transformers_utils.py.

References

Qwen3-ASR paper: arXiv:2601.21337 (streaming algorithm: 2s chunk size, 5-token fallback, 4 unfixed chunks)
Qwen3-ASR official streaming implementation: qwen_asr/inference/qwen3_asr.py
vLLM streaming RFC: [RFC]: Model-specific realtime streaming abstraction vllm-project/vllm#35908 (client-side approach discussion)
vLLM streaming quality findings: [Enhancement]: Qwen3-ASR realtime endpoint produces degraded output — stateless segments, no cross-segment context, raw format leaks vllm-project/vllm#35767 (same degradation issues)

Approach

Based on the Qwen3-ASR streaming algorithm (arXiv:2601.21337):

Audio is split into 2-second accumulated chunks (each chunk contains all audio from 0 to current position)
Each chunk is processed as an independent transcription request
Previous transcript is fed as decoder prefix with rollback (last 5 words dropped) to reduce boundary jitter
First 2 chunks are "cold start" — no prefix is injected (model doesn't have enough context yet)
Stable text deltas are emitted via SSE word-by-word for smooth output
Client disconnection is detected between chunks to avoid wasted computation

This is a server-side chunked streaming approach. vLLM is exploring a complementary client-side approach (vllm-project/vllm#35908) where callers control chunking via a prefix parameter. Both use the same underlying algorithm.

New files

File	Purpose
python/sglang/srt/entrypoints/openai/streaming_asr.py	StreamingASRState (prefix rollback state management with configurable chunk size, rollback window, cold start period), split_audio_chunks (split audio into accumulated chunks), build_streaming_prompt (construct chat-template prompt with optional prefix text)

Modified files

File	What changed
`serving_transcription.py`	Add `_generate_chunked_asr_stream` method. Route `stream=true` requests through chunked ASR pipeline when model family is `qwen3_asr`. Detect client disconnection between chunks via `raw_request.is_disconnected()`.
`hf_transformers_utils.py`	Add `Qwen3ASRConfig` to `_CONFIG_REGISTRY` import and registry list (fix for #22073 — without this, `AutoConfig.from_pretrained` fails with `KeyError: 'qwen3_asr'`).

These are consistent with vLLM's findings (vllm-project/vllm#35767): their Qwen3-ASR realtime endpoint also produces degraded output with stateless segments and no cross-segment context. Quality improves with larger chunk_size_sec (e.g. 2s → 4s, fewer boundaries) at the cost of higher first-text latency.

Tested Cases

Test audio: https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac (11s English speech)

Derived clips generated via:

import soundfile as sf
data, sr = sf.read('/tmp/test_speech.flac')
sf.write('/tmp/short.flac', data[:int(sr*1.5)], sr)   # 1.5s, 1 chunk
sf.write('/tmp/exact.flac', data[:int(sr*2.0)], sr)   # 2.0s, exact boundary
sf.write('/tmp/mid.flac',   data[:int(sr*5.0)], sr)   # 5s, 3 chunks
Chinese test audio (gTTS)
from gtts import gTTS

import librosa, soundfile as sf

gTTS('今天天气真好，我想去公园散步，顺便买一些新鲜的水果回家做饭', lang='zh').save('/tmp/test_zh.mp3')

data, sr = librosa.load('/tmp/test_zh.mp3', sr=16000)

sf.write('/tmp/test_zh.flac', data, sr)

Case	Input	Result
Offline	test_speech.flac (11s)	✅ Full accurate transcription
Streaming, 11s	test_speech.flac (6 chunks)	✅ Word-by-word, near-identical to offline (only — → : punctuation diff)
Streaming, 1.5s	short.flac (1 chunk)	✅ finalize() single-chunk path works
Streaming, 2.0s	exact.flac (chunk boundary)	✅ Edge case handled
Streaming, 5s	mid.flac (3 chunks)	✅ Multi-chunk path works
Empty audio	touch empty.flac	✅ 400: audio_data is empty
Corrupted audio	echo "garbage" > garbage.flac	✅ 400: Format not recognised
response_format=text	test_speech.flac	✅ Plain text response
response_format=verbose_json	test_speech.flac	✅ JSON with language/duration
Streaming, Chinese audio	`test_zh.flac` (8s, gTTS)	✅ Output correct, but full text returned in single delta (CJK rollback ineffective, known limitation — see `StreamingASRState` docstring)

Architecture: Why Each Chunk Is an Independent Request

Each chunk is processed as a separate SGLang inference request. This has two consequences:

Repeated encoding: Each chunk re-encodes all accumulated audio from the start. For 10s audio with 2s chunks, total encoder work is 2+4+6+8+10 = 30s instead of 10s (~3x overhead). Acceptable for small models (0.6B encoder < 100ms per chunk) but becomes a bottleneck for long audio.
No shared state across chunks: No KV cache reuse between chunks, no encoder output caching. SGLang's scheduler treats each request independently — no concept of a "streaming session" that persists across requests.

This is the MVP architecture choice: operate purely at the serving layer, no scheduler or model code changes. Keeps the change minimal and reviewable.

Known Limitations (planned follow-ups)

Encoder window caching: Cache completed 100-frame encoder windows across chunks to avoid re-encoding (requires model-layer changes to get_audio_feature)
Token-level prefix: Current prefix is text-level (string concatenation); token-level prefix injection would be more precise
Cross-chunk KV cache reuse: Requires scheduler to support persistent streaming sessions
True real-time input: WebSocket endpoint for live audio streaming instead of chunked processing of complete audio
Token-level output within chunks: Use stream=True per chunk for smoother output (current implementation emits word-by-word after each chunk completes)
prefix API parameter: Add optional prefix parameter to /v1/audio/transcriptions for client-side streaming control (as discussed in [RFC]: Model-specific realtime streaming abstraction vllm-project/vllm#35908)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Follow the SGLang code style guidance.

gemini-code-assist · 2026-04-04T02:33:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

JustinTong0323 · 2026-04-04T11:21:34Z

The serving_transcript is somewhat hardcoded for Whisper, we could refactor it a bit.

JustinTong0323

PR Review Summary

Thanks for the clean implementation! The server-side chunked streaming approach is well-designed and minimal. Below are some issues found during review, organized by severity.

Critical (3): StopAsyncIteration escape, prompt template duplication, missing input validation
Important (6): prefix text source, language=None, CJK word-splitting, silent None returns, formula duplication, rsplit edge case
Suggestions (4): state invariants, dangerous default, silent fallback, docstring typo

JustinTong0323 · 2026-04-04T11:56:02Z

+    chunk_size_sec: float = 2.0  # Paper: "2-second chunk size"
+    unfixed_chunk_num: int = 2  # qwen3_asr.py default: 2 paper: 4
+    unfixed_token_num: int = 5  # Paper: "5-token fallback"


Are these numbers pre-selected for qwen3-asr only? Could we expose it for user to change?

Right, these are Qwen3-ASR defaults setting from the paper. The dataclass already supports override via constructor, i think adding different defaults for new ASR models should be straightforward. Exposing as user-facing API parameters can be a follow up pr.

SammLSH · 2026-04-04T19:05:04Z

@JustinTong0323 Thanks for the review and detailed feedback. All review feedback addressed. Critical and most Important issues are fixed. Some items (hardcoded or hasattr, PretrainedConfig fallback, rsplit edge case, silent None returns) are from #22073 , left for @adityavaid for now, but happy to fix in this PR if you prefer. Refactoring (strategy pattern, CJK rollback, API parameter exposure) planned as follow ups. Ready for rereview.

ShangmingCai · 2026-04-07T03:41:51Z

LGTM for the modification of encode_server

AgainstEntropy · 2026-04-08T19:17:01Z

hi @SammLSH , can you rebase to main and adjust accordingly since #22181 was merged?
Also can you add some documentations for ASR streaming input?

Add chunked streaming transcription for ASR models (e.g. Qwen3-ASR). Audio is processed in configurable chunks with prefix rollback to reduce boundary jitter, emitting partial transcripts via SSE. - Add StreamingASRState for prefix rollback state management - Add split_audio_chunks utility for cumulative audio chunking - Extend TranscriptionAdapter with supports_chunked_streaming, prompt_template, and chunked_streaming_config - Route chunked streaming via adapter pattern (no model-specific logic in serving layer)

SammLSH · 2026-04-08T23:53:53Z

Hi @AgainstEntropy, done. I'll add ASR streaming documentation shortly and create a separate follow-up PR for streaming input via WebSocket tonight, with a design document, sample inputs, and tests. While this PR focuses on server side chunked streaming of complete audio uploads, the new PR will build on top of this chunked streaming implementation.

mickqian · 2026-04-09T00:57:53Z

/tag-and-rerun-ci

mickqian

consider adding a bench serving test as a follow up. cheers

Address review feedback: import _DEFAULT_ASR_PROMPT from multimodal/processors/qwen3_asr.py instead of duplicating it.

Make the prompt template a public constant since it is imported by the transcription adapter.

JustinTong0323 · 2026-04-09T05:17:45Z

Bug: Streaming output drops spaces at chunk boundaries

Streaming SSE deltas concatenated result vs offline:

Offline: He hoped there would be stew for dinner—turnips...
Streaming: He hopedthere would be stew fordinner: turnips...

hopedthere, fordinner, andbruised, tobe — spaces missing at chunk boundaries.

Root cause in _generate_chunked_asr_stream:

for j, word in enumerate(delta.split(" ")):
    if not word:
        continue
    content = word if j == 0 else " " + word

When j == 0, no leading space is added. Within a single delta this is fine, but across chunk boundaries the last word of chunk N and first word of chunk N+1 get concatenated without a separator.

Repro:

wget -O /tmp/test.flac "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"
python3 -m sglang.launch_server --model-path Qwen/Qwen3-ASR-0.6B --port 38080 --disable-cuda-graph

# Compare:
curl -s localhost:38080/v1/audio/transcriptions -F file=@/tmp/test.flac -F model=Qwen/Qwen3-ASR-0.6B -F response_format=text
curl -s localhost:38080/v1/audio/transcriptions -F file=@/tmp/test.flac -F model=Qwen/Qwen3-ASR-0.6B -F stream=true | grep -oP '"content":"[^"]*"' | sed 's/"content":"//;s/"$//' | tr -d '\n'

The first word of each chunk delta was missing a leading space, causing words at chunk boundaries to concatenate without separator (e.g. "hopedthere" instead of "hoped there"). Track a first_word flag across all chunks so only the very first word of the entire stream omits the leading space.

JustinTong0323

Tested streaming ASR on Qwen3-ASR-0.6B. Space-dropping bug at chunk boundaries is fixed. Streaming vs offline diff is now only punctuation (expected). LGTM.

JustinTong0323 · 2026-04-09T05:53:19Z

/tag-and-rerun-ci

…ct#22089)

SammLSH requested review from ByronHsu, CatherineSue, JustinTong0323, ShangmingCai, hnyls2002, ispobock, merrymercy, mickqian, slin1237, yhyang201 and yuan-luo as code owners April 4, 2026 02:33

SammLSH mentioned this pull request Apr 4, 2026

[model] support qwen3-asr #22025

Closed

JustinTong0323 self-assigned this Apr 4, 2026

JustinTong0323 reviewed Apr 4, 2026

View reviewed changes

JustinTong0323 requested changes Apr 4, 2026

View reviewed changes

SammLSH force-pushed the feat/qwen3-asr-streaming branch from 42c93e5 to 1de73b9 Compare April 4, 2026 18:35

SammLSH force-pushed the feat/qwen3-asr-streaming branch from cea7314 to cac24fd Compare April 4, 2026 20:02

JustinTong0323 mentioned this pull request Apr 6, 2026

[Feature] Adding Qwen3-asr Model Support #22073

Merged

10 tasks

AgainstEntropy mentioned this pull request Apr 6, 2026

[refactor] [asr] Add transcription adapter for extensible ASR models support #22181

Merged

5 tasks

SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from 168d7da to a27564e Compare April 6, 2026 21:46

github-actions Bot added documentation Improvements or additions to documentation Multi-modal multi-modal language model labels Apr 6, 2026

ShangmingCai reviewed Apr 7, 2026

View reviewed changes

SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from e4e5388 to f7a2504 Compare April 7, 2026 07:45

SammLSH requested a review from JustinTong0323 April 7, 2026 07:47

SammLSH force-pushed the feat/qwen3-asr-streaming branch 3 times, most recently from bde404a to 5c45046 Compare April 7, 2026 08:18

AgainstEntropy self-requested a review April 8, 2026 20:08

SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from 101484f to 969d984 Compare April 8, 2026 22:15

SammLSH force-pushed the feat/qwen3-asr-streaming branch from 969d984 to 918e08d Compare April 8, 2026 23:19

github-actions Bot added the run-ci label Apr 9, 2026

mickqian reviewed Apr 9, 2026

View reviewed changes

AgainstEntropy reviewed Apr 9, 2026

View reviewed changes

Comment thread python/sglang/srt/entrypoints/openai/transcription_adapters/qwen3_asr.py Outdated

SammLSH force-pushed the feat/qwen3-asr-streaming branch 2 times, most recently from 2a7a15e to 918e08d Compare April 9, 2026 02:24

fix: import prompt template from processor for single source of truth

b012e19

Address review feedback: import _DEFAULT_ASR_PROMPT from multimodal/processors/qwen3_asr.py instead of duplicating it.

AgainstEntropy approved these changes Apr 9, 2026

View reviewed changes

fix: rename _DEFAULT_ASR_PROMPT to DEFAULT_ASR_PROMPT (public API)

5d44261

Make the prompt template a public constant since it is imported by the transcription adapter.

JustinTong0323 approved these changes Apr 9, 2026

View reviewed changes

Merge branch 'main' into feat/qwen3-asr-streaming

f3dd88d

mickqian merged commit 8b991d9 into sgl-project:main Apr 9, 2026
134 of 157 checks passed

SammLSH mentioned this pull request Apr 10, 2026

[RFC]: Real-Time Streaming Audio Input for ASR Models #22474

Open

hongbo-miao mentioned this pull request Apr 11, 2026

Automatic Speech Recognition (ASR) hongbo-miao/hongbomiao.com#30424

Closed

This was referenced Apr 14, 2026

[Feature] WebSocket streaming audio input for ASR #22821

Closed

[Feature] WebSocket streaming audio input for ASR #22848

Open

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[feature] asr: add chunk-based streaming ASR for Qwen3-ASR (sgl-proje…

c439b09

…ct#22089)

Conversation

SammLSH commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

References

Approach

New files

Modified files

Tested Cases

Chinese test audio (gTTS)

Architecture: Why Each Chunk Is an Independent Request

Known Limitations (planned follow-ups)

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 4, 2026

Uh oh!

JustinTong0323 commented Apr 4, 2026

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

PR Review Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

SammLSH Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SammLSH commented Apr 4, 2026

Uh oh!

ShangmingCai Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AgainstEntropy commented Apr 8, 2026

Uh oh!

SammLSH commented Apr 8, 2026

Uh oh!

mickqian commented Apr 9, 2026

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JustinTong0323 commented Apr 9, 2026

Bug: Streaming output drops spaces at chunk boundaries

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SammLSH commented Apr 4, 2026 •

edited

Loading

SammLSH Apr 4, 2026 •

edited

Loading

ShangmingCai Apr 7, 2026 •

edited

Loading