Skip to content

[Feature] Adding Qwen3-asr Model Support#22073

Merged
mickqian merged 12 commits intosgl-project:mainfrom
adityavaid:adddingQwen3AsrModelSupport
Apr 7, 2026
Merged

[Feature] Adding Qwen3-asr Model Support#22073
mickqian merged 12 commits intosgl-project:mainfrom
adityavaid:adddingQwen3AsrModelSupport

Conversation

@adityavaid
Copy link
Copy Markdown
Contributor

@adityavaid adityavaid commented Apr 3, 2026

Motivation

Issue : #22025

This PR adds support so users can serve Qwen3-ASR via the existing /v1/audio/transcriptions endpoint.

References

New files

File Purpose
python/sglang/srt/configs/qwen3_asr.py Config classes (Qwen3ASRConfig, Qwen3ASRThinkerConfig) handling the nested thinker_config → audio_config / text_config layout
python/sglang/srt/models/qwen3_asr.py Model class reusing Qwen3OmniMoeAudioEncoder + Qwen3ForCausalLM, with weight-loading prefix remapping (thinker.* → internal names)
python/sglang/srt/multimodal/processors/qwen3_asr.py Multimodal processor with <|audio_start|>/<|audio_pad|>/<|audio_end|> token handling
test/manual/test_qwen3_asr.py Manual test — launches server, sends audio, validates output

Modified files

File What changed
configs/__init__.py Export Qwen3ASRConfig
configs/model_config.py Add Qwen3ASRForConditionalGeneration to multimodal_model_archs and is_audio_model(). Fix is_audio_understandable_model to check nested thinker_config.audio_config
disaggregation/encode_server.py Add qwen3_asr branch in _get_feat_extract_output_lengths (same formula as qwen3_omni_moe)
serving_transcription.py Detect model family at init. For Qwen3-ASR: build chat-template prompt, strip <asr_text> prefix from output, skip Whisper-specific timestamp parsing in verbose_json

Accuracy Tests

  • Verify existing Whisper tests still pass (no behavior change for Whisper path)
  • Test with both Qwen/Qwen3-ASR-0.6B and Qwen/Qwen3-ASR-1.7B
  • Adding Unit Tests for qwen3_asr
  • Server successfully launched using command
python3 -m sglang.launch_server    --model Qwen/Qwen3-ASR-0.6B    --port 30000    --host 0.0.0.0    --served-model-name qwen3-asr    --mem-fraction-static 0.85
#:/sgl-workspace/sglang# curl -s http://localhost:30000/v1/audio/transcriptions  -F file=@/tmp/test.flac  -F model=qwen3-asr  -F response_format=verbose_json | python3 -m json.tool
{
    "task": "transcribe",
    "language": null,
    "duration": 10.44,
    "text": "He hoped there would be stew for dinner\u2014turnips and carrots and bruised potatoes and fat mutton pieces\u2014to be ladled out in thick peppered flour-fatted sauce.",
    "segments": [],
    "usage": {
        "type": "duration",
        "seconds": 11
    }
}
#:/sgl-workspace/sglang# curl -s http://localhost:30000/generate   -H "Content-Type: application/json"   -d '{
    "text": "<|im_start|>user\n<|audio_start|><|audio_pad|><|audio_end|><|im_end|>\n<|im_start|>assistant\n",
    "audio_data": "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac",
    "modalities": ["audio"],
    "sampling_params": {"temperature": 0, "max_new_tokens": 256}
  }'
{"text":"language English<asr_text>He hoped there would be stew for dinner—turnips and carrots and bruised potatoes and fat mutton pieces—to be ladled out in thick peppered flour-fatted sauce.","output_ids":[11528,6364,151704,1519,25189,1052,1035,387,60343,369,13856,2293,412,3077,323,61417,323,42000,4056,34167,323,8664,296,959,9666,49517,387,57625,832,700,304,12045,24353,291,19828,2220,12127,19187,13,151645],"meta_info":{"id":"7d1491d1ad264b039ce40452cb269e28","finish_reason":{"type":"stop","matched":151645},"prompt_tokens":146,"weight_version":"default","total_retractions":0,"completion_tokens":40,"cached_tokens":0,"cached_tokens_details":null,"dp_rank":null,"e2e_latency":1.1166456790015218,"response_sent_to_client_ts":1775261852.3463674}}r

Speed Tests and Profiling

Qwen3-ASR-1.7B Benchmark

------------------------------
Results for Qwen/Qwen3-ASR-1.7B:
Total Requests: 50
WER: 25.2366
Average Latency: 0.4370s
Median Latency: 0.3654s
95th Latency: 0.8698s
Throughput: 5.17 req/s
Token Throughput: 117.69 tok/s
Total Test Time: 9.6776s
------------------------------

==================== Sample Predictions ====================
Sample 1:
  REF: um, on the use of taxonomy, i, you know, i think it's, it's early days for us to, to make any, um, clear indications to the market about, uh, the proportion that would fall under that, um, requirement.
  PRED: on the eu taxonomy, i think it's early days for us to make any clear indications to the market about the proportion that would fall under that requirement. so.
----------------------------------------
Sample 2:
  REF: so within fiscal year 2021, say 120, a hundred depending on what the micro will do, and next year, uh, it's not necessarily payable in q1, is we'll look at what the cash flows for 2022 look like.
  PRED: so within fiscal year 2021, say 120, depending on what the macro will do, and next year, it's not necessarily payable in q1, is we'll look at what the cash flows for 2022 look like and.
----------------------------------------
Sample 3:
  REF: we talked about 4.7 gigawatts.
  PRED: we talked about 4.7 gigawatts.
----------------------------------------
Sample 4:
  REF: and, you know, depending on that working capital build, we'll, we'll see what that yields.
  PRED: and depending on that working capital build, we'll see what that yields.
----------------------------------------
Sample 5:
  REF: so on, on sinopec, what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30, 70%.
  PRED: so, on sanopek, what we have agreed with sanopek way back then is that free cash flows, after paying all capexes, are distributed out 30-70%.
----------------------------------------
============================================================

Qwen3-ASR-0.6B Benchmark

------------------------------
Results for Qwen/Qwen3-ASR-0.6B:
Total Requests: 50
WER: 23.6593
Average Latency: 0.3149s
Median Latency: 0.2252s
95th Latency: 1.1177s
Throughput: 6.99 req/s
Token Throughput: 152.87 tok/s
Total Test Time: 7.1565s
------------------------------

==================== Sample Predictions ====================
Sample 1:
  REF: um, on the use of taxonomy, i, you know, i think it's, it's early days for us to, to make any, um, clear indications to the market about, uh, the proportion that would fall under that, um, requirement.
  PRED: on the eu taxonomy, i think it's early days for us to make any clear indications to the market about the proportion that would fall under that requirement.
----------------------------------------
Sample 2:
  REF: so within fiscal year 2021, say 120, a hundred depending on what the micro will do, and next year, uh, it's not necessarily payable in q1, is we'll look at what the cash flows for 2022 look like.
  PRED: so within fiscal year 2021, say 120, depending on what the micro will do, and next year, it's not necessarily payable in q1. is we'll look at what the cash flows for 2022 look like and.
----------------------------------------
Sample 3:
  REF: we talked about 4.7 gigawatts.
  PRED: we talked about 4.7 gigawatts.
----------------------------------------
Sample 4:
  REF: and, you know, depending on that working capital build, we'll, we'll see what that yields.
  PRED: and depending on that working capital build, we'll see what that yields.
----------------------------------------
Sample 5:
  REF: so on, on sinopec, what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30, 70%.
  PRED: so on on sinopac, what we have agreed with sinopac way back then is that free cash flows after paying all capexes are distributed out 30-70.
----------------------------------------
============================================================

Benchmark Results : SGLang v/s Transformers

Audio Sample used for testing:
AUDIO_EN = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
AUDIO_ZH = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav"

Model: Qwen/Qwen3-ASR-0.6B

Metric SGLang Transformers (HF)
EN transcription "Oh yeah, yeah. He wasn't even that big when I started listening to him. But and …" "Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and ..."
ZH transcription "甚至出现交易几乎停滞的情况。" 甚至出现交易几乎停滞的情况。
EN avg latency (5 runs) 0.246 s 1.113s
ZH avg latency (5 runs) 0.088 s 0.258s

Model: Qwen/Qwen3-ASR-1.7B

Metric SGLang Transformers (HF)
EN transcription "Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, …" "Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and..."
ZH transcription "甚至出现交易几乎停滞的情况。" "甚至出现交易几乎停滞的情况。"
EN avg latency (5 runs) 0.472 s 1.158s
ZH avg latency (5 runs) 0.123 s 0.272s

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@adityavaid adityavaid changed the title [Feat] Adding Qwen3_asr Model Support [Feat] Adding Qwen3-asr Model Support Apr 3, 2026
@adityavaid adityavaid changed the title [Feat] Adding Qwen3-asr Model Support [Feature] Adding Qwen3-asr Model Support Apr 3, 2026
@AgainstEntropy
Copy link
Copy Markdown
Collaborator

Hi @adityavaid , thanks for your PR.
I tested with python -m sglang.launch_server --model Qwen/Qwen3-ASR-1.7B --trust-remote-code --port 30010 --host 0.0.0.0 --served-model-name qwen3-asr and it raised the following error:

AttributeError: Qwen2Tokenizer has no attribute tokenizer. Did you mean: '_tokenizer'?

Would you mind sharing the manual tests scripts and/or commands you were using? thx!

@adityavaid
Copy link
Copy Markdown
Contributor Author

@mickqian addressed all comments

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 5, 2026

how about numbers of transformers?

@adityavaid
Copy link
Copy Markdown
Contributor Author

adityavaid commented Apr 5, 2026

how about numbers of transformers?

Added that in another section, used a separate script to run transformers test in my setup. Should I add the script here for review ?

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 5, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 5, 2026
@adityavaid
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 6, 2026

may I have accuracy comparison result?

Comment thread python/sglang/srt/configs/model_config.py
Comment thread python/sglang/srt/configs/qwen3_asr.py Outdated
Comment thread python/sglang/srt/configs/qwen3_asr.py
Comment thread python/sglang/srt/entrypoints/openai/serving_transcription.py
@AgainstEntropy
Copy link
Copy Markdown
Collaborator

may I have accuracy comparison result?

Dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered (511 samples, validation split)
GPU: 1× H200

Qwen3-ASR-0.6B

Metric Transformers (bs=4) SGLang (concurrency=4)
WER 23.56 23.58
Avg Latency 0.3276s 0.4352s
Median Latency 0.2837s 0.3483s
P95 Latency 0.5034s 0.6952s
Throughput 3.05 req/s 8.50 req/s
Token Throughput 59.17 tok/s 162.84 tok/s
Total Time 167.50s 60.15s

Qwen3-ASR-1.7B

Metric Transformers (bs=4) SGLang (concurrency=4)
WER 23.92 24.48
Avg Latency 0.3389s 0.4348s
Median Latency 0.2946s 0.3909s
P95 Latency 0.5268s 0.6906s
Throughput 2.95 req/s 8.45 req/s
Token Throughput 58.01 tok/s 165.80 tok/s
Total Time 173.26s 60.51s

Note: Latency is measured differently — Transformers reports batch_time / batch_size (amortized), while SGLang reports per-request end-to-end time (including network + server queuing).
Throughput and Total Time are more suitable for direct comparison.

Commands

Transformers

git clone https://github.com/QwenLM/Qwen3-ASR
cd Qwen3-ASR/
pip install -e .
pip install datasets evaluate jiwer librosa soundfile

bench_transformers.py

# 0.6B
python bench_transformers.py --model Qwen/Qwen3-ASR-0.6B --batch-size 4 --output results-0.6B.json

# 1.7B
python bench_transformers.py --model Qwen/Qwen3-ASR-1.7B --batch-size 4 --output results-1.7B.json

SGLang

pip install datasets evaluate jiwer librosa soundfile openai

# 0.6B
python -m sglang.launch_server --model Qwen/Qwen3-ASR-0.6B --port 30000 --host 0.0.0.0 --mem-fraction-static 0.85
python benchmark/asr/bench_sglang.py --base-url http://localhost:30000 --model Qwen/Qwen3-ASR-0.6B --api-type transcription --output results-0.6B.json

# 1.7B
python -m sglang.launch_server --model Qwen/Qwen3-ASR-1.7B --port 30000 --host 0.0.0.0 --mem-fraction-static 0.85
python benchmark/asr/bench_sglang.py --base-url http://localhost:30000 --model Qwen/Qwen3-ASR-1.7B --api-type transcription --output results-1.7B.json

@adityavaid
Copy link
Copy Markdown
Contributor Author

@mickqian
I did add the accuracy benchmark WER in the Description.
Can you approve again, I addressed all the other comments .

SammLSH added a commit to SammLSH/sglang that referenced this pull request Apr 6, 2026
Implement streaming transcription with chunk-based processing and
prefix rollback, based on the Qwen3-ASR paper (arXiv:2601.21337).

New files:
- streaming_asr.py: StreamingASRState, split_audio_chunks, build_streaming_prompt

Modified files:
- serving_transcription.py: route streaming requests through chunked
  ASR pipeline for Qwen3-ASR model family
- hf_transformers_utils.py: add Qwen3ASRConfig to _CONFIG_REGISTRY

Depends on sgl-project#22073 for Qwen3-ASR model support.
Ref: sgl-project#22025 (streaming input), vllm-project/vllm#35908 (related RFC)
@mickqian mickqian merged commit f6e8567 into sgl-project:main Apr 7, 2026
267 of 325 checks passed
SammLSH added a commit to SammLSH/sglang that referenced this pull request Apr 7, 2026
Implement streaming transcription with chunk-based processing and
prefix rollback, based on the Qwen3-ASR paper (arXiv:2601.21337).

New files:
- streaming_asr.py: StreamingASRState, split_audio_chunks, build_streaming_prompt

Modified files:
- serving_transcription.py: route streaming requests through chunked
  ASR pipeline for Qwen3-ASR model family
- hf_transformers_utils.py: add Qwen3ASRConfig to _CONFIG_REGISTRY

Depends on sgl-project#22073 for Qwen3-ASR model support.
Ref: sgl-project#22025 (streaming input), vllm-project/vllm#35908 (related RFC)
Fridge003 pushed a commit that referenced this pull request Apr 7, 2026
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation jit-kernel Multi-modal multi-modal language model quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants