[Feature] WebSocket streaming audio input for ASR by SammLSH · Pull Request #22848 · sgl-project/sglang

SammLSH · 2026-04-15T04:29:23Z

Motivation

Implements M1 of the RFC in #22474.

PR #22089 shipped chunked streaming output for Qwen3-ASR via POST /v1/audio/transcriptions?stream=true (SSE over an HTTP upload), which assumes the entire audio file is known up-front. Real-time use cases (live captioning, voice assistants, meeting transcription) need the opposite direction: the server accepts audio as it arrives and pushes partial transcripts back as the speaker talks.

Modifications

New WebSocket endpoint

Endpoint: WS /v1/audio/transcriptions/stream (registered in http_server.py)
Wire protocol (inspired by OpenAI Realtime API conventions, simplified session/delta/final event model):
- Client → session.start (JSON) → binary PCM16 frames → session.end (JSON)
- Server → session.started / transcript.delta (per word) / transcript.final / error
Audio format (M1): fixed protocol — PCM16, 16 kHz, mono, little-endian. session.start accepts optional model and language fields; any other format will be rejected at PCM-frame validation (invalid_audio_format if frame byte length is not a multiple of 2, i.e. truncated int16). A future milestone can add an audio_format negotiation field without breaking the existing wire format.
Backpressure: --asr-max-buffer-seconds CLI flag (default 60s, validated > 0 at startup). If accumulated server-side audio exceeds the cap, the server sends a buffer_overflow error and closes the socket. Below the cap, the single-task coroutine alternates receive and inference, so while a chunk is inferring the client experiences standard TCP-level backpressure (ws.send blocks on a full socket buffer). No silent drop.
Concurrency model: single-task — one coroutine alternates receive + inference; session.end is therefore serialized after any in-flight chunk.

Note on HTTP SSE path: switching get_prefix_text() from confirmed_text to emitted_text (see streaming_asr.py row below) also incidentally fixes a latent prompt-prefix continuation issue in the HTTP SSE path from #22089, where confirmed_text could roll back mid-sentence and cause the model to re-emit from scratch on long English audio. Regression covered by the existing 8 HTTP tests.

Architecture

The WS transport uses an OO RealtimeConnection class (mirroring vLLM's RealtimeConnection in vllm/entrypoints/openai/realtime/connection.py) — one instance per WebSocket connection, holding session state and exposing lifecycle methods. Wire messages are typed Pydantic models so schema validation, type-checking, and JSON serialization flow from a single source of truth.

The HTTP SSE and WebSocket paths both route through a shared inference driver, keeping streaming state in StreamingASRState at the adapter layer rather than lifting it into the transport layer.

flowchart TB
    HTTP["HTTP<br/>(file upload)"]
    WS["WebSocket<br/>(live PCM frames)"]

    subgraph serving["OpenAIServingTranscription"]
        SSE["_generate_chunked_asr_stream"]
        WSH["handle_websocket<br/>(delegator)"]
    end

    PAC["process_asr_chunk<br/>(shared inference driver)"]

    subgraph state["StreamingASRState"]
        CT["confirmed_text<br/>chunk-local rollback, delta diff basis"]
        ET["emitted_text<br/>monotonic accumulator, prompt prefix source"]
        UF["update() / finalize()"]
    end

    HTTP --> SSE
    WS --> WSH
    SSE --> PAC
    WSH --> PAC
    PAC --> state

WebSocket session lifecycle

Inside serving_transcription_websocket.py, one coroutine alternates receive and inference. Each PCM batch of chunk_size_bytes triggers exactly one inference pass through the shared driver.

sequenceDiagram
    autonumber
    participant C as Client
    participant H as RealtimeConnection
    participant I as process_asr_chunk
    participant S as StreamingASRState

    C->>H: session.start (JSON)
    H->>H: _init<br/>accept + adapter capability check
    H-->>C: session.started

    loop per chunk_size_bytes of new audio
        C->>H: binary PCM16 frames
        H->>H: _on_audio_frame<br/>accumulate into pcm_buffer
        H->>I: _run_inference(_pcm_to_wav(buffer))
        I->>S: update() → delta
        I-->>H: delta str
        H-->>C: transcript.delta (per word)
    end

    C->>H: session.end (JSON)
    H->>I: _run_inference(is_last=True)
    I->>S: finalize() → tail
    I-->>H: tail delta
    H-->>C: transcript.final
    H->>C: close socket

Files touched

File	Change
`python/sglang/srt/entrypoints/http_server.py`	Register `WS /v1/audio/transcriptions/stream` route
`python/sglang/srt/entrypoints/openai/serving_transcription.py`	Extract `process_asr_chunk` into `streaming_asr.py` so HTTP and WS share the inference driver; add `handle_websocket` delegator
`python/sglang/srt/entrypoints/openai/serving_transcription_websocket.py`	NEW — WS transport: `RealtimeConnection` class encapsulating per-session state and lifecycle methods (`_init` / `_run_loop` / `_on_session_start` / `_on_session_end` / `_on_audio_frame` / `_run_inference`); Pydantic event models (`SessionStartEvent`, `SessionStartedEvent`, `TranscriptDeltaEvent`, `TranscriptFinalEvent`, `ErrorEvent`) as protocol schema source-of-truth; private `_pcm_to_wav` adapter for protocol-fixed PCM16/16kHz/mono; bare-string error codes matching sglang/vLLM convention (no enum)
`python/sglang/srt/entrypoints/openai/streaming_asr.py`	Extend `StreamingASRState` with an `emitted_text` accumulator used as the prompt prefix in `get_prefix_text()` (previously `confirmed_text`); also extract `process_asr_chunk` as the shared HTTP/WS inference driver and add a `normalize_whitespace` helper for batched-inference punctuation jitter
`python/sglang/srt/server_args.py`	`asr_max_buffer_seconds: int = 60` + CLI flag + `_handle_asr_validation()` startup check (rejects `≤ 0` to fail fast on misconfiguration)
`test/manual/models/test_qwen3_asr.py`	19 tests total (8 HTTP non-stream + 11 WebSocket). `EXPECTED_TRANSCRIPTS` reference dict, `_wer` Levenshtein helper, WS assertions via `_assert_close_to_ref(WER ≤ 0.15)`

Accuracy Tests

Manual end-to-end verification against HTTP non-streaming (one-shot) ground truth across 7 audio fixtures × 3 paths (HTTP JSON / HTTP SSE / WebSocket), all captured against a single live sglang.launch_server running Qwen/Qwen3-ASR-0.6B on a GH200:

Audio	Length	HTTP non-stream (truth)	HTTP SSE	WebSocket	Notes
EN (Qwen podcast)	15.05s	`Oh yeah, yeah. He wasn't even that big when I started listening to him. But and his solo music didn't do overly well, but he did very well when he started writing for other people.`	`Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.`	`Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.`	WER 0.054 vs truth. `Uh huh.` prefix hallucination (vocal-fry intro, chunked-inference artifact from #22089) + `But→but` casing + `.→,` mid-sentence drift. 37 WS deltas. SSE ≡ WS byte-for-byte.
ZH (Qwen news)	4.20s	`甚至出现交易几乎停滞的情况。`	`甚至出现交易几乎停滞的情况。`	`甚至出现交易几乎停滞的情况。`	All 3 paths byte-identical. 1 WS delta (CJK: no space-splitting).
MLK speech	13.00s	`I have a dream that one day this nation will rise up and live out the true meaning of its creed.`	`I have a dream that one day this nation will rise up and live out the true meaning of its creed.`	`I have a dream that one day this nation will rise up and live out the true meaning of its creed.`	All 3 paths byte-identical. 21 WS deltas.
LibriSpeech dummy	10.44s	`He hoped there would be stew for dinner—turnips and carrots and bruised potatoes and fat mutton pieces—to be ladled out in thick peppered flour-fatted sauce.`	`He hoped there would be stew for dinner: turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fatted sauce.`	`He hoped there would be stew for dinner: turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fatted sauce.`	Word-level identical to truth, `—` → `:` + second `—` dropped. Normalized WER 0.000. 27 WS deltas. SSE ≡ WS byte-for-byte.
Spanish (LibriSpeech-style)	6.58s	`y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumaje`	`Y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumaje.`	`Y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumaje.`	Case + trailing period only. Normalized WER 0.000. 14 WS deltas. SSE ≡ WS byte-for-byte.
Hindi	4.13s	`मिर्ची में कितने विभिन्न प्रजातियाँ हैं`	`मिर्ची में कितने विभिन्न प्रजातियाँ हैं`	`मिर्ची में कितने विभिन्न प्रजातियाँ हैं`	All 3 paths byte-identical. 6 WS deltas.
MP3 stereo (I know kung fu)	3.98s	`I know kung fu.`	`I know kung fu.`	`I know kung fu.`	All 3 paths byte-identical. 4 WS deltas.

Cross-path consistency: on every fixture the HTTP-SSE output and the WebSocket output are byte-for-byte identical — you can read the SSE and WS columns and see this for yourself. This is the observable consequence of the two transports routing through the same process_asr_chunk driver over the same StreamingASRState; the transport layer introduces no additional drift beyond what chunked inference itself produces vs. one-shot.

WER threshold rationale: WS assertions use WER ≤ 0.15. This tolerates the chunked-inference boundary artifacts inherited from #22089 (e.g. the Uh huh. prefix on the EN clip) while still catching real regressions — the maximum observed WER across these 7 fixtures is 0.054, so the 0.15 threshold leaves ~3× headroom for flakiness without hiding regressions.

5 consecutive stability runs of the 19-test suite passed (all 19 tests green each run), with byte-identical WS transcripts across runs on a GH200. Median wall-clock 52.3s per run once the model weights are cached.

Streaming verification: on the 15s EN audio in realtime-pacing mode (client sends PCM at 0.5s/frame = wall-clock rate), ~30 out of 37 transcript.delta events arrive at the client before the client sends session.end, confirming true incremental server push (not batch-at-end).

M1 completion checklist (from RFC)

✅ WebSocket endpoint /v1/audio/transcriptions/stream
✅ Protocol: session.start / binary PCM / session.end ⇄ session.started / transcript.delta / transcript.final / error
✅ 8 error codes: invalid_json, invalid_payload, invalid_state, invalid_audio_format, unknown_message, buffer_overflow, unsupported_model, internal_error
✅ Single-task concurrency, --asr-max-buffer-seconds backpressure
✅ No scheduler or engine changes
✅ Reuse StreamingASRState, process_asr_chunk, TranscriptionAdapter
⬜ M2 (cross-chunk RadixCache prefill reuse): out of scope for this PR
⬜ M3 (token-level streaming within chunks via GenerateReqInput.stream=True, forced-alignment timestamps): out of scope

Known limitations

Listed here so reviewers don't have to re-discover them:

Limitation	Origin	Can this be fixed?
EN 15s audio has a `"Uh huh."` prefix hallucination because the first 2s chunk sees only vocal fry / silence, and the append-only delta protocol can't retract it later. WER ~0.054.	Inherited from #22089 chunked inference.	Not at the serving layer. Would need either (a) skip-emit-for-noise heuristic on chunk 0, or (b) token-level timestamps (M3).
CJK languages (Chinese, Japanese, …) produce a single `transcript.delta` event because `str.split()` can't word-tokenize them. Final transcript is still correct. TODO already in `StreamingASRState` docstring.	Inherited from #22089.	M3 — token-level overlap instead of word-level.
Boundary-aligned real word repetition (theoretical): if the audio contains a legitimately repeated phrase that aligns exactly with a 2s chunk boundary, the current merge algorithm could skip one repetition. None of the 7 test fixtures trigger this; would show up on songs / chants / a full 8-verse MLK speech.	Trade-off of the current merge algorithm.	M3 — forced-alignment timestamp-based merge. This is the fundamental reason the RFC files M3 as a separate milestone.
Silence (all-zero PCM) crashes Qwen3-ASR in `mm_utils.py:_adjust_embedding_length`. Upstream bug, unrelated to this PR.	Upstream sglang bug.	Tracked in #XXXXX.
<2s of real speech triggers Qwen3-ASR short-context hallucinations (1s EN clip returns `嗯哼。`). Short-clip test uses 3s MP3 to avoid this.	Model limitation.	Not at the serving layer.

Test plan

Tests live under test/manual/ because they require downloading the Qwen/Qwen3-ASR-0.6B checkpoint (~1.2GB) and a GPU. They are runnable locally with a single command and complete in ~52s on one H100.

Automated (single command)

cd /path/to/sglang
python test/manual/models/test_qwen3_asr.py

The file uses popen_launch_server to spin up its own sglang.launch_server, runs 19 tests, and tears the server down.

Test breakdown:

Category	Count	What it covers
HTTP non-streaming (ground truth)	8	EN, ZH, MLK, LibriSpeech, Spanish, Hindi, MP3 stereo, EN × 3 consistency
WebSocket streaming — happy paths	7	EN / ZH / Hindi / Spanish / MLK / LibriSpeech / MP3 (fast-mode PCM)
WebSocket streaming — edges	4	Realtime-pacing EN, 3 concurrent sessions on same audio (state isolation + determinism), 3s short clip (`has_new_audio → is_last=True` path), 4s chunk-boundary clip (exact multiple of `chunk_size_bytes` → hits `_on_session_end`'s `elif state.full_transcript` flush-tail branch without running another inference)

WS assertions use _assert_close_to_ref(WER ≤ 0.15) against EXPECTED_TRANSCRIPTS (a dict of canonical transcripts captured from one-shot non-streaming inference). _wer normalizes case/punctuation and falls back to character-level comparison for CJK.

Manual (single audio, step-by-step)

Launch the server:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-ASR-0.6B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 127.0.0.1 --port 30000

HTTP non-streaming (baseline / ground truth):

curl -s http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr | jq -r .text

HTTP SSE streaming:

curl -Ns http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr -F stream=true

WebSocket streaming — save as wsasr.py and run python wsasr.py audio.wav [language]. Sends PCM in 0.5s frames at wall-clock realtime pacing while concurrently reading transcript.delta events; each line is prefixed with t=<sec> from session start so you can see deltas arriving while audio is still uploading.

import asyncio
import json
import sys
import time

import numpy as np
import soundfile as sf
import websockets


async def main(path, lang=None):
    data, sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)
    if sr != 16000:
        n = int(len(data) / sr * 16000)
        data = np.interp(np.linspace(0, len(data) - 1, n), np.arange(len(data)), data)
        sr = 16000
    pcm = (data * 32767).astype(np.int16).tobytes()

    url = "ws://127.0.0.1:30000/v1/audio/transcriptions/stream"
    async with websockets.connect(url) as ws:
        start = {"type": "session.start"}
        if lang:
            start["language"] = lang

        t0 = time.perf_counter()
        await ws.send(json.dumps(start))

        def stamp():
            return time.perf_counter() - t0

        ack_raw = await ws.recv()
        ack = json.loads(ack_raw)
        print(f"[+{stamp():5.2f}] << {ack_raw}")
        if ack.get("type") != "session.started":
            print(f"[+{stamp():5.2f}] !! expected session.started, got {ack.get('type')}; aborting")
            return

        async def receive_loop():
            try:
                async for raw in ws:
                    print(f"[+{stamp():5.2f}] << {raw}")
                    msg = json.loads(raw)
                    if msg["type"] == "transcript.final":
                        return
                    if msg["type"] == "error":
                        print(f"[+{stamp():5.2f}] !! server error, closing")
                        return
            except websockets.ConnectionClosed as e:
                print(f"[+{stamp():5.2f}] !! connection closed: {e}")

        receiver = asyncio.create_task(receive_loop())

        chunk = int(0.5 * sr) * 2  # 0.5s of int16 PCM
        try:
            for i in range(0, len(pcm), chunk):
                if receiver.done():
                    print(f"[+{stamp():5.2f}] !! receiver ended early, stopping send")
                    break
                await ws.send(pcm[i : i + chunk])
                await asyncio.sleep(0.5)  # realtime pacing; drop for fast mode

            if not receiver.done():
                print(f"[+{stamp():5.2f}] >> session.end")
                await ws.send(json.dumps({"type": "session.end"}))
        except websockets.ConnectionClosed as e:
            print(f"[+{stamp():5.2f}] !! send failed, connection closed: {e}")

        await receiver


if __name__ == "__main__":
    asyncio.run(main(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else None))

Error-path verification

Manually tested all 8 error codes against a running server: invalid_json, invalid_payload, invalid_state × 3 variants, invalid_audio_format, unknown_message, buffer_overflow, unsupported_model, internal_error. All return the documented error event and close the socket.

Speed Tests and Profiling

No impact on inference speed — this PR is a thin WebSocket transport layer on top of unchanged chunked inference. Per-chunk latency is bound by the existing chunk_size_sec (2s) + model inference time (~0.5–1.5s on H100 for Qwen3-ASR-0.6B). No new CUDA kernels, no new memory patterns, no scheduler changes.

Checklist

Format your code according to the Format code with pre-commit (black, isort, ruff, codespell all passing)
Add unit tests according to the Run and add unit tests (19 tests in test/manual/models/test_qwen3_asr.py)
Documentation: wire protocol documented in the Modifications section above
Provide accuracy benchmark results — see "Accuracy Tests" section above
Follow the SGLang code style guidance

Review and Merge Process

Ping Merge Oncalls to start the process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments.
After green CI and required approvals, merge.

Related

RFC: [RFC]: Real-Time Streaming Audio Input for ASR Models #22474
Parent issue: [model] support qwen3-asr #22025
Prerequisite: PR [Feature] Add chunk-based streaming ASR for Qwen3-ASR #22089 (chunked streaming output, merged)
Prerequisite: PR [refactor] [asr] Add transcription adapter for extensible ASR models support #22181 (TranscriptionAdapter refactor, merged)
Related: PR [Feature] Adding Qwen3-asr Model Support #22073 (Qwen3-ASR model support, merged)

gemini-code-assist · 2026-04-15T04:29:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

SammLSH · 2026-04-15T05:18:14Z

@mickqian @JustinTong0323 @AgainstEntropy

- Wrap session lifecycle in RealtimeConnection class; remove free-standing handler functions and the RealtimeTranscriptionSession dataclass. State and methods now live on a single per-connection instance, mirroring vLLM's RealtimeConnection structure. - Replace ad-hoc dict messages with Pydantic event models (SessionStartEvent, SessionStartedEvent, TranscriptDeltaEvent, TranscriptFinalEvent, ErrorEvent) for schema-validated send/receive. Pydantic now catches invalid_payload cases (wrong types, missing fields) automatically. - Drop RealtimeMessageType and RealtimeErrorCode enums in favor of bare strings, matching sglang's serving_base.py and vLLM's realtime/ connection.py conventions. - Module docstring now shows concrete wire examples and explicitly notes the protocol is sglang-specific (does not align with OpenAI Realtime / vLLM /v1/realtime). - Add note in _pcm_to_wav about cumulative re-encoding cost; deferred to M2 (RadixCache prefix caching). - Update test docstring to drop reference to renamed internal helper.

Reject non-positive values at startup so misconfiguration fails fast instead of producing immediately-closing sessions (value == 0) or disabling the backpressure cap entirely (value < 0, comparison never trips). Validation runs before the dummy-model short-circuit so it applies regardless of which model is being served.

AgainstEntropy · 2026-05-01T01:54:56Z

+            TranscriptionRequest(language=event.language)
+            if event.language
+            else TranscriptionRequest()


Suggested change

TranscriptionRequest(language=event.language)

if event.language

else TranscriptionRequest()

TranscriptionRequest(language=event.language)

AgainstEntropy · 2026-05-01T02:11:20Z

+        adapter = self.serving._adapter
+        if not adapter.supports_chunked_streaming:


_adapter is a private member of OpenAIServingTranscriptionand we are now accessing it across classes.
I'm considering two alternative manners:

expose a public adapter property or create delegation to re-export some properties and methods of _adapter

handle_realtime_transcription accepts tokenizer_manager and _adapter instead of OpenAIServingTranscription (this also gets rid of circular imports and TYPE_CHECKING.)

option 2 sounds good to me. RealtimeConnection only uses tokenizer_manager and adapter from serving , and passing them in also lets us drop the TYPE_CHECKING import.

AgainstEntropy · 2026-05-01T02:12:30Z

+    tokenizer_manager,
+    adapter,


please add type hints

AgainstEntropy · 2026-05-01T03:39:13Z

+def _pcm_to_wav(pcm_buffer: bytes) -> bytes:
+    """Wrap raw PCM16 mono 16 kHz bytes in a WAV container so
+    ``soundfile.read`` (called by the multimodal processor) can decode it.
+
+    Note: callers re-encode the entire cumulative buffer per chunk (M1
+    constraint). Cost is bounded by ``--asr-max-buffer-seconds``; M2 plans
+    RadixCache prefix caching to remove this overhead.
+    """
+    if not pcm_buffer:
+        raise ValueError("pcm_buffer is empty")
+    samples = np.frombuffer(pcm_buffer, dtype=np.int16)
+    buf = io.BytesIO()
+    sf.write(buf, samples, _SAMPLE_RATE, format="WAV")
+    return buf.getvalue()


here PCM is converted to WAV, but later the multimodal processor will need to decode WAV to a float32 ndarray. Potentially we can make load_audio accept a pre-decoded ndarray to skip the "round trip".
(maybe a follow-up PR)

will do the ndarray path as a follow-up.

AgainstEntropy · 2026-05-01T03:42:28Z

            "sampling parameters if available. Default is 'model'.",
        )
+        parser.add_argument(
+            "--asr-max-buffer-seconds",


this is a cap per session, and globally we need a --asr-max-concurrent-sessions

will add it in the next commit

AgainstEntropy · 2026-05-01T04:21:03Z

+This protocol is sglang-specific; it does not align with OpenAI's Realtime
+API spec (vLLM's ``/v1/realtime``). Clients written against OpenAI Realtime
+will not work as-is. See ``test/manual/models/test_qwen3_asr.py`` for a
+reference client and the Pydantic event models below for full schema.


I think it will be nice if we can have an OpenAI-compatible protocol from now on, because it will influence some future designs.
cc. @mickqian

agreed, i can update it in next commit

Move /v1/audio/transcriptions/stream to /v1/realtime and switch from the M1 session.start/binary-PCM protocol to OpenAI's Realtime transcription wire format. The shared inference driver is untouched, so HTTP SSE and WS still produce byte-identical transcripts; this is purely a transport rewrite. sglang deviations from the spec live in the module docstring: sample_rate is a sglang extension accepting 16/24/48 kHz with internal resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and noise_reduction must be null (no server-side VAD), include[] is dropped, model is echo-only. Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription), sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32). sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's input contract.

Move /v1/audio/transcriptions/stream to /v1/realtime and switch from the M1 session.start/binary-PCM protocol to OpenAI's Realtime transcription wire format. Add --asr-max-concurrent-sessions (default 32) for the global session cap. sglang deviations from the spec are documented in the module docstring: sample_rate accepts 16/24/48 kHz with internal resample to 16 kHz, turn_detection and noise_reduction must be null, include[] is dropped, model is echo-only.

5 PR-added HTTP fixture tests now use _assert_close_to_ref(max_wer=0.05) instead of len > 0. New test_websocket_rejects_unsupported_sample_rate covers the 22050 reject path. Spanish WS keeps native 48 kHz so the server-side resample runs (was bypassed by client pre-resample). Wraps the WS receive task in asyncio.wait_for(timeout=60) so a stuck inference fails fast with the deltas seen so far. Trims docstrings, section banners, and assertIn scaffolding. 20/20 pass, ~53s.

isinstance guards now reject non-dict session.audio, session.audio.input, and session.audio.input.transcription with invalid_value instead of crashing into the unrecoverable handler. Tests added for each. buffer_overflow now reports error_type=server_error (was rate_limit_exceeded, which would have misled clients into backoff when the right action is reconnect). Extracted _format_error_event so _send_error and the pre-connection too_many_sessions path share one error-event schema. Also makes _send_error's param/error_type keyword-only, drops a stale "per spec" claim from the commit-failure comment, and notes the long-frame single-inference behavior in _on_audio_append.

Replaces ad-hoc isinstance guards in _on_session_update / _on_audio_append with Pydantic models for SessionUpdate / InputAudioBufferAppend / Commit / Clear. ValidationError is caught in _run_loop and emitted as invalid_value with the joined loc as param, preserving the per-field error shape the existing reject tests expect (session.audio, session.audio.input, session.audio.input.transcription, audio). Also closes the WebSocket with code 1009 ("message too big") after buffer_overflow so clients can distinguish session-resource exhaustion from a normal close.

SammLSH requested review from CatherineSue, JustinTong0323, ispobock, merrymercy and slin1237 as code owners April 15, 2026 04:29

[Feature] WebSocket streaming audio input for ASR

2f6c8b5

SammLSH force-pushed the feat/ws-streaming-asr-input branch from 236cea0 to 2f6c8b5 Compare April 15, 2026 05:01

AgainstEntropy mentioned this pull request Apr 16, 2026

[RFC]: Real-Time Streaming Audio Input for ASR Models #22474

Open

Merge branch 'main' into feat/ws-streaming-asr-input

68d9413

hongbo-miao mentioned this pull request Apr 20, 2026

Automatic Speech Recognition (ASR) hongbo-miao/hongbomiao.com#30424

Closed

SammLSH and others added 3 commits April 27, 2026 01:50

Merge branch 'main' into feat/ws-streaming-asr-input

1d30bf6

AgainstEntropy reviewed May 1, 2026

View reviewed changes

SammLSH force-pushed the feat/ws-streaming-asr-input branch 2 times, most recently from 0725add to 3167135 Compare May 4, 2026 05:31

SammLSH force-pushed the feat/ws-streaming-asr-input branch from 3167135 to 296fff3 Compare May 4, 2026 19:17

Merge branch 'main' into feat/ws-streaming-asr-input

a54a8b9

PopSoda2002 mentioned this pull request May 5, 2026

[V1, Feature] Add OpenAI Realtime WebSocket endpoint (M0) sgl-project/sglang-omni#385

Open

SammLSH added 3 commits May 5, 2026 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] WebSocket streaming audio input for ASR#22848

[Feature] WebSocket streaming audio input for ASR#22848
SammLSH wants to merge 10 commits intosgl-project:mainfrom
SammLSH:feat/ws-streaming-asr-input

SammLSH commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Uh oh!

SammLSH commented Apr 15, 2026

Uh oh!

AgainstEntropy May 1, 2026

Uh oh!

AgainstEntropy May 1, 2026

Uh oh!

SammLSH May 1, 2026

Uh oh!

AgainstEntropy May 1, 2026

Uh oh!

AgainstEntropy May 1, 2026

Uh oh!

SammLSH May 1, 2026

Uh oh!

AgainstEntropy May 1, 2026

Uh oh!

SammLSH May 1, 2026

Uh oh!

AgainstEntropy May 1, 2026

Uh oh!

SammLSH May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		adapter = self.serving._adapter
		if not adapter.supports_chunked_streaming:

Conversation

SammLSH commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

New WebSocket endpoint

Architecture

WebSocket session lifecycle

Files touched

Accuracy Tests

M1 completion checklist (from RFC)

Known limitations

Test plan

Automated (single command)

Manual (single audio, step-by-step)

Error-path verification

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Uh oh!

SammLSH commented Apr 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SammLSH commented Apr 15, 2026 •

edited

Loading