Skip to content

[Feature] WebSocket streaming audio input for ASR#22848

Open
SammLSH wants to merge 10 commits intosgl-project:mainfrom
SammLSH:feat/ws-streaming-asr-input
Open

[Feature] WebSocket streaming audio input for ASR#22848
SammLSH wants to merge 10 commits intosgl-project:mainfrom
SammLSH:feat/ws-streaming-asr-input

Conversation

@SammLSH
Copy link
Copy Markdown
Contributor

@SammLSH SammLSH commented Apr 15, 2026

Motivation

Implements M1 of the RFC in #22474.

PR #22089 shipped chunked streaming output for Qwen3-ASR via POST /v1/audio/transcriptions?stream=true (SSE over an HTTP upload), which assumes the entire audio file is known up-front. Real-time use cases (live captioning, voice assistants, meeting transcription) need the opposite direction: the server accepts audio as it arrives and pushes partial transcripts back as the speaker talks.

Modifications

New WebSocket endpoint

  • Endpoint: WS /v1/audio/transcriptions/stream (registered in http_server.py)
  • Wire protocol (inspired by OpenAI Realtime API conventions, simplified session/delta/final event model):
    • Client → session.start (JSON) → binary PCM16 frames → session.end (JSON)
    • Server → session.started / transcript.delta (per word) / transcript.final / error
  • Audio format (M1): fixed protocol — PCM16, 16 kHz, mono, little-endian. session.start accepts optional model and language fields; any other format will be rejected at PCM-frame validation (invalid_audio_format if frame byte length is not a multiple of 2, i.e. truncated int16). A future milestone can add an audio_format negotiation field without breaking the existing wire format.
  • Backpressure: --asr-max-buffer-seconds CLI flag (default 60s, validated > 0 at startup). If accumulated server-side audio exceeds the cap, the server sends a buffer_overflow error and closes the socket. Below the cap, the single-task coroutine alternates receive and inference, so while a chunk is inferring the client experiences standard TCP-level backpressure (ws.send blocks on a full socket buffer). No silent drop.
  • Concurrency model: single-task — one coroutine alternates receive + inference; session.end is therefore serialized after any in-flight chunk.

Note on HTTP SSE path: switching get_prefix_text() from confirmed_text to emitted_text (see streaming_asr.py row below) also incidentally fixes a latent prompt-prefix continuation issue in the HTTP SSE path from #22089, where confirmed_text could roll back mid-sentence and cause the model to re-emit from scratch on long English audio. Regression covered by the existing 8 HTTP tests.

Architecture

The WS transport uses an OO RealtimeConnection class (mirroring vLLM's RealtimeConnection in vllm/entrypoints/openai/realtime/connection.py) — one instance per WebSocket connection, holding session state and exposing lifecycle methods. Wire messages are typed Pydantic models so schema validation, type-checking, and JSON serialization flow from a single source of truth.

The HTTP SSE and WebSocket paths both route through a shared inference driver, keeping streaming state in StreamingASRState at the adapter layer rather than lifting it into the transport layer.

flowchart TB
    HTTP["HTTP<br/>(file upload)"]
    WS["WebSocket<br/>(live PCM frames)"]

    subgraph serving["OpenAIServingTranscription"]
        SSE["_generate_chunked_asr_stream"]
        WSH["handle_websocket<br/>(delegator)"]
    end

    PAC["process_asr_chunk<br/>(shared inference driver)"]

    subgraph state["StreamingASRState"]
        CT["confirmed_text<br/>chunk-local rollback, delta diff basis"]
        ET["emitted_text<br/>monotonic accumulator, prompt prefix source"]
        UF["update() / finalize()"]
    end

    HTTP --> SSE
    WS --> WSH
    SSE --> PAC
    WSH --> PAC
    PAC --> state
Loading

WebSocket session lifecycle

Inside serving_transcription_websocket.py, one coroutine alternates receive and inference. Each PCM batch of chunk_size_bytes triggers exactly one inference pass through the shared driver.

sequenceDiagram
    autonumber
    participant C as Client
    participant H as RealtimeConnection
    participant I as process_asr_chunk
    participant S as StreamingASRState

    C->>H: session.start (JSON)
    H->>H: _init<br/>accept + adapter capability check
    H-->>C: session.started

    loop per chunk_size_bytes of new audio
        C->>H: binary PCM16 frames
        H->>H: _on_audio_frame<br/>accumulate into pcm_buffer
        H->>I: _run_inference(_pcm_to_wav(buffer))
        I->>S: update() → delta
        I-->>H: delta str
        H-->>C: transcript.delta (per word)
    end

    C->>H: session.end (JSON)
    H->>I: _run_inference(is_last=True)
    I->>S: finalize() → tail
    I-->>H: tail delta
    H-->>C: transcript.final
    H->>C: close socket
Loading

Files touched

File Change
python/sglang/srt/entrypoints/http_server.py Register WS /v1/audio/transcriptions/stream route
python/sglang/srt/entrypoints/openai/serving_transcription.py Extract process_asr_chunk into streaming_asr.py so HTTP and WS share the inference driver; add handle_websocket delegator
python/sglang/srt/entrypoints/openai/serving_transcription_websocket.py NEW — WS transport: RealtimeConnection class encapsulating per-session state and lifecycle methods (_init / _run_loop / _on_session_start / _on_session_end / _on_audio_frame / _run_inference); Pydantic event models (SessionStartEvent, SessionStartedEvent, TranscriptDeltaEvent, TranscriptFinalEvent, ErrorEvent) as protocol schema source-of-truth; private _pcm_to_wav adapter for protocol-fixed PCM16/16kHz/mono; bare-string error codes matching sglang/vLLM convention (no enum)
python/sglang/srt/entrypoints/openai/streaming_asr.py Extend StreamingASRState with an emitted_text accumulator used as the prompt prefix in get_prefix_text() (previously confirmed_text); also extract process_asr_chunk as the shared HTTP/WS inference driver and add a normalize_whitespace helper for batched-inference punctuation jitter
python/sglang/srt/server_args.py asr_max_buffer_seconds: int = 60 + CLI flag + _handle_asr_validation() startup check (rejects ≤ 0 to fail fast on misconfiguration)
test/manual/models/test_qwen3_asr.py 19 tests total (8 HTTP non-stream + 11 WebSocket). EXPECTED_TRANSCRIPTS reference dict, _wer Levenshtein helper, WS assertions via _assert_close_to_ref(WER ≤ 0.15)

Accuracy Tests

Manual end-to-end verification against HTTP non-streaming (one-shot) ground truth across 7 audio fixtures × 3 paths (HTTP JSON / HTTP SSE / WebSocket), all captured against a single live sglang.launch_server running Qwen/Qwen3-ASR-0.6B on a GH200:

Audio Length HTTP non-stream (truth) HTTP SSE WebSocket Notes
EN (Qwen podcast) 15.05s Oh yeah, yeah. He wasn't even that big when I started listening to him. But and his solo music didn't do overly well, but he did very well when he started writing for other people. Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people. Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people. WER 0.054 vs truth. Uh huh. prefix hallucination (vocal-fry intro, chunked-inference artifact from #22089) + But→but casing + .→, mid-sentence drift. 37 WS deltas. SSE ≡ WS byte-for-byte.
ZH (Qwen news) 4.20s 甚至出现交易几乎停滞的情况。 甚至出现交易几乎停滞的情况。 甚至出现交易几乎停滞的情况。 All 3 paths byte-identical. 1 WS delta (CJK: no space-splitting).
MLK speech 13.00s I have a dream that one day this nation will rise up and live out the true meaning of its creed. I have a dream that one day this nation will rise up and live out the true meaning of its creed. I have a dream that one day this nation will rise up and live out the true meaning of its creed. All 3 paths byte-identical. 21 WS deltas.
LibriSpeech dummy 10.44s He hoped there would be stew for dinner—turnips and carrots and bruised potatoes and fat mutton pieces—to be ladled out in thick peppered flour-fatted sauce. He hoped there would be stew for dinner: turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fatted sauce. He hoped there would be stew for dinner: turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fatted sauce. Word-level identical to truth, : + second dropped. Normalized WER 0.000. 27 WS deltas. SSE ≡ WS byte-for-byte.
Spanish (LibriSpeech-style) 6.58s y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumaje Y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumaje. Y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumaje. Case + trailing period only. Normalized WER 0.000. 14 WS deltas. SSE ≡ WS byte-for-byte.
Hindi 4.13s मिर्ची में कितने विभिन्न प्रजातियाँ हैं मिर्ची में कितने विभिन्न प्रजातियाँ हैं मिर्ची में कितने विभिन्न प्रजातियाँ हैं All 3 paths byte-identical. 6 WS deltas.
MP3 stereo (I know kung fu) 3.98s I know kung fu. I know kung fu. I know kung fu. All 3 paths byte-identical. 4 WS deltas.

Cross-path consistency: on every fixture the HTTP-SSE output and the WebSocket output are byte-for-byte identical — you can read the SSE and WS columns and see this for yourself. This is the observable consequence of the two transports routing through the same process_asr_chunk driver over the same StreamingASRState; the transport layer introduces no additional drift beyond what chunked inference itself produces vs. one-shot.

WER threshold rationale: WS assertions use WER ≤ 0.15. This tolerates the chunked-inference boundary artifacts inherited from #22089 (e.g. the Uh huh. prefix on the EN clip) while still catching real regressions — the maximum observed WER across these 7 fixtures is 0.054, so the 0.15 threshold leaves ~3× headroom for flakiness without hiding regressions.

5 consecutive stability runs of the 19-test suite passed (all 19 tests green each run), with byte-identical WS transcripts across runs on a GH200. Median wall-clock 52.3s per run once the model weights are cached.

Streaming verification: on the 15s EN audio in realtime-pacing mode (client sends PCM at 0.5s/frame = wall-clock rate), ~30 out of 37 transcript.delta events arrive at the client before the client sends session.end, confirming true incremental server push (not batch-at-end).

M1 completion checklist (from RFC)

  • ✅ WebSocket endpoint /v1/audio/transcriptions/stream
  • ✅ Protocol: session.start / binary PCM / session.endsession.started / transcript.delta / transcript.final / error
  • ✅ 8 error codes: invalid_json, invalid_payload, invalid_state, invalid_audio_format, unknown_message, buffer_overflow, unsupported_model, internal_error
  • ✅ Single-task concurrency, --asr-max-buffer-seconds backpressure
  • ✅ No scheduler or engine changes
  • ✅ Reuse StreamingASRState, process_asr_chunk, TranscriptionAdapter
  • M2 (cross-chunk RadixCache prefill reuse): out of scope for this PR
  • M3 (token-level streaming within chunks via GenerateReqInput.stream=True, forced-alignment timestamps): out of scope

Known limitations

Listed here so reviewers don't have to re-discover them:

Limitation Origin Can this be fixed?
EN 15s audio has a "Uh huh." prefix hallucination because the first 2s chunk sees only vocal fry / silence, and the append-only delta protocol can't retract it later. WER ~0.054. Inherited from #22089 chunked inference. Not at the serving layer. Would need either (a) skip-emit-for-noise heuristic on chunk 0, or (b) token-level timestamps (M3).
CJK languages (Chinese, Japanese, …) produce a single transcript.delta event because str.split() can't word-tokenize them. Final transcript is still correct. TODO already in StreamingASRState docstring. Inherited from #22089. M3 — token-level overlap instead of word-level.
Boundary-aligned real word repetition (theoretical): if the audio contains a legitimately repeated phrase that aligns exactly with a 2s chunk boundary, the current merge algorithm could skip one repetition. None of the 7 test fixtures trigger this; would show up on songs / chants / a full 8-verse MLK speech. Trade-off of the current merge algorithm. M3 — forced-alignment timestamp-based merge. This is the fundamental reason the RFC files M3 as a separate milestone.
Silence (all-zero PCM) crashes Qwen3-ASR in mm_utils.py:_adjust_embedding_length. Upstream bug, unrelated to this PR. Upstream sglang bug. Tracked in #XXXXX.
<2s of real speech triggers Qwen3-ASR short-context hallucinations (1s EN clip returns 嗯哼。). Short-clip test uses 3s MP3 to avoid this. Model limitation. Not at the serving layer.

Test plan

Tests live under test/manual/ because they require downloading the Qwen/Qwen3-ASR-0.6B checkpoint (~1.2GB) and a GPU. They are runnable locally with a single command and complete in ~52s on one H100.

Automated (single command)

cd /path/to/sglang
python test/manual/models/test_qwen3_asr.py

The file uses popen_launch_server to spin up its own sglang.launch_server, runs 19 tests, and tears the server down.

Test breakdown:

Category Count What it covers
HTTP non-streaming (ground truth) 8 EN, ZH, MLK, LibriSpeech, Spanish, Hindi, MP3 stereo, EN × 3 consistency
WebSocket streaming — happy paths 7 EN / ZH / Hindi / Spanish / MLK / LibriSpeech / MP3 (fast-mode PCM)
WebSocket streaming — edges 4 Realtime-pacing EN, 3 concurrent sessions on same audio (state isolation + determinism), 3s short clip (has_new_audio → is_last=True path), 4s chunk-boundary clip (exact multiple of chunk_size_bytes → hits _on_session_end's elif state.full_transcript flush-tail branch without running another inference)

WS assertions use _assert_close_to_ref(WER ≤ 0.15) against EXPECTED_TRANSCRIPTS (a dict of canonical transcripts captured from one-shot non-streaming inference). _wer normalizes case/punctuation and falls back to character-level comparison for CJK.

Manual (single audio, step-by-step)

Launch the server:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-ASR-0.6B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 127.0.0.1 --port 30000

HTTP non-streaming (baseline / ground truth):

curl -s http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr | jq -r .text

HTTP SSE streaming:

curl -Ns http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr -F stream=true

WebSocket streaming — save as wsasr.py and run python wsasr.py audio.wav [language]. Sends PCM in 0.5s frames at wall-clock realtime pacing while concurrently reading transcript.delta events; each line is prefixed with t=<sec> from session start so you can see deltas arriving while audio is still uploading.

import asyncio
import json
import sys
import time

import numpy as np
import soundfile as sf
import websockets


async def main(path, lang=None):
    data, sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)
    if sr != 16000:
        n = int(len(data) / sr * 16000)
        data = np.interp(np.linspace(0, len(data) - 1, n), np.arange(len(data)), data)
        sr = 16000
    pcm = (data * 32767).astype(np.int16).tobytes()

    url = "ws://127.0.0.1:30000/v1/audio/transcriptions/stream"
    async with websockets.connect(url) as ws:
        start = {"type": "session.start"}
        if lang:
            start["language"] = lang

        t0 = time.perf_counter()
        await ws.send(json.dumps(start))

        def stamp():
            return time.perf_counter() - t0

        ack_raw = await ws.recv()
        ack = json.loads(ack_raw)
        print(f"[+{stamp():5.2f}] << {ack_raw}")
        if ack.get("type") != "session.started":
            print(f"[+{stamp():5.2f}] !! expected session.started, got {ack.get('type')}; aborting")
            return

        async def receive_loop():
            try:
                async for raw in ws:
                    print(f"[+{stamp():5.2f}] << {raw}")
                    msg = json.loads(raw)
                    if msg["type"] == "transcript.final":
                        return
                    if msg["type"] == "error":
                        print(f"[+{stamp():5.2f}] !! server error, closing")
                        return
            except websockets.ConnectionClosed as e:
                print(f"[+{stamp():5.2f}] !! connection closed: {e}")

        receiver = asyncio.create_task(receive_loop())

        chunk = int(0.5 * sr) * 2  # 0.5s of int16 PCM
        try:
            for i in range(0, len(pcm), chunk):
                if receiver.done():
                    print(f"[+{stamp():5.2f}] !! receiver ended early, stopping send")
                    break
                await ws.send(pcm[i : i + chunk])
                await asyncio.sleep(0.5)  # realtime pacing; drop for fast mode

            if not receiver.done():
                print(f"[+{stamp():5.2f}] >> session.end")
                await ws.send(json.dumps({"type": "session.end"}))
        except websockets.ConnectionClosed as e:
            print(f"[+{stamp():5.2f}] !! send failed, connection closed: {e}")

        await receiver


if __name__ == "__main__":
    asyncio.run(main(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else None))

Error-path verification

Manually tested all 8 error codes against a running server: invalid_json, invalid_payload, invalid_state × 3 variants, invalid_audio_format, unknown_message, buffer_overflow, unsupported_model, internal_error. All return the documented error event and close the socket.

Speed Tests and Profiling

No impact on inference speed — this PR is a thin WebSocket transport layer on top of unchanged chunked inference. Per-chunk latency is bound by the existing chunk_size_sec (2s) + model inference time (~0.5–1.5s on H100 for Qwen3-ASR-0.6B). No new CUDA kernels, no new memory patterns, no scheduler changes.

Checklist

  • Format your code according to the Format code with pre-commit (black, isort, ruff, codespell all passing)
  • Add unit tests according to the Run and add unit tests (19 tests in test/manual/models/test_qwen3_asr.py)
  • Documentation: wire protocol documented in the Modifications section above
  • Provide accuracy benchmark results — see "Accuracy Tests" section above
  • Follow the SGLang code style guidance

Review and Merge Process

  1. Ping Merge Oncalls to start the process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments.
  4. After green CI and required approvals, merge.

Related

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@SammLSH SammLSH force-pushed the feat/ws-streaming-asr-input branch from 236cea0 to 2f6c8b5 Compare April 15, 2026 05:01
@SammLSH
Copy link
Copy Markdown
Contributor Author

SammLSH commented Apr 15, 2026

SammLSH and others added 3 commits April 27, 2026 01:50
- Wrap session lifecycle in RealtimeConnection class; remove free-standing
  handler functions and the RealtimeTranscriptionSession dataclass. State
  and methods now live on a single per-connection instance, mirroring
  vLLM's RealtimeConnection structure.
- Replace ad-hoc dict messages with Pydantic event models (SessionStartEvent,
  SessionStartedEvent, TranscriptDeltaEvent, TranscriptFinalEvent,
  ErrorEvent) for schema-validated send/receive. Pydantic now catches
  invalid_payload cases (wrong types, missing fields) automatically.
- Drop RealtimeMessageType and RealtimeErrorCode enums in favor of bare
  strings, matching sglang's serving_base.py and vLLM's realtime/
  connection.py conventions.
- Module docstring now shows concrete wire examples and explicitly notes
  the protocol is sglang-specific (does not align with OpenAI Realtime
  / vLLM /v1/realtime).
- Add note in _pcm_to_wav about cumulative re-encoding cost; deferred
  to M2 (RadixCache prefix caching).
- Update test docstring to drop reference to renamed internal helper.
Reject non-positive values at startup so misconfiguration fails fast
instead of producing immediately-closing sessions (value == 0) or
disabling the backpressure cap entirely (value < 0, comparison never
trips). Validation runs before the dummy-model short-circuit so it
applies regardless of which model is being served.
Comment on lines +274 to +276
TranscriptionRequest(language=event.language)
if event.language
else TranscriptionRequest()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TranscriptionRequest(language=event.language)
if event.language
else TranscriptionRequest()
TranscriptionRequest(language=event.language)

Comment on lines +198 to +199
adapter = self.serving._adapter
if not adapter.supports_chunked_streaming:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_adapter is a private member of OpenAIServingTranscriptionand we are now accessing it across classes.
I'm considering two alternative manners:

  1. expose a public adapter property or create delegation to re-export some properties and methods of _adapter
  2. handle_realtime_transcription accepts tokenizer_manager and _adapter instead of OpenAIServingTranscription (this also gets rid of circular imports and TYPE_CHECKING.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option 2 sounds good to me. RealtimeConnection only uses tokenizer_manager and adapter from serving , and passing them in also lets us drop the TYPE_CHECKING import.

Comment on lines +123 to +124
tokenizer_manager,
adapter,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add type hints

Comment on lines +62 to +75
def _pcm_to_wav(pcm_buffer: bytes) -> bytes:
"""Wrap raw PCM16 mono 16 kHz bytes in a WAV container so
``soundfile.read`` (called by the multimodal processor) can decode it.

Note: callers re-encode the entire cumulative buffer per chunk (M1
constraint). Cost is bounded by ``--asr-max-buffer-seconds``; M2 plans
RadixCache prefix caching to remove this overhead.
"""
if not pcm_buffer:
raise ValueError("pcm_buffer is empty")
samples = np.frombuffer(pcm_buffer, dtype=np.int16)
buf = io.BytesIO()
sf.write(buf, samples, _SAMPLE_RATE, format="WAV")
return buf.getvalue()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here PCM is converted to WAV, but later the multimodal processor will need to decode WAV to a float32 ndarray. Potentially we can make load_audio accept a pre-decoded ndarray to skip the "round trip".
(maybe a follow-up PR)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do the ndarray path as a follow-up.

"sampling parameters if available. Default is 'model'.",
)
parser.add_argument(
"--asr-max-buffer-seconds",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a cap per session, and globally we need a --asr-max-concurrent-sessions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add it in the next commit

Comment on lines +21 to +24
This protocol is sglang-specific; it does not align with OpenAI's Realtime
API spec (vLLM's ``/v1/realtime``). Clients written against OpenAI Realtime
will not work as-is. See ``test/manual/models/test_qwen3_asr.py`` for a
reference client and the Pydantic event models below for full schema.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be nice if we can have an OpenAI-compatible protocol from now on, because it will influence some future designs.
cc. @mickqian

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, i can update it in next commit

SammLSH added a commit to SammLSH/sglang that referenced this pull request May 4, 2026
Move /v1/audio/transcriptions/stream to /v1/realtime and switch from
the M1 session.start/binary-PCM protocol to OpenAI's Realtime
transcription wire format. The shared inference driver is untouched,
so HTTP SSE and WS still produce byte-identical transcripts; this is
purely a transport rewrite.

sglang deviations from the spec live in the module docstring:
sample_rate is a sglang extension accepting 16/24/48 kHz with internal
resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and
noise_reduction must be null (no server-side VAD), include[] is
dropped, model is echo-only.

Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription),
sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32).
sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's
input contract.
@SammLSH SammLSH force-pushed the feat/ws-streaming-asr-input branch 2 times, most recently from 0725add to 3167135 Compare May 4, 2026 05:31
Move /v1/audio/transcriptions/stream to /v1/realtime and switch from
the M1 session.start/binary-PCM protocol to OpenAI's Realtime
transcription wire format. Add --asr-max-concurrent-sessions (default
32) for the global session cap.

sglang deviations from the spec are documented in the module docstring:
sample_rate accepts 16/24/48 kHz with internal resample to 16 kHz,
turn_detection and noise_reduction must be null, include[] is dropped,
model is echo-only.
@SammLSH SammLSH force-pushed the feat/ws-streaming-asr-input branch from 3167135 to 296fff3 Compare May 4, 2026 19:17
SammLSH added 3 commits May 5, 2026 04:00
5 PR-added HTTP fixture tests now use _assert_close_to_ref(max_wer=0.05)
instead of len > 0. New test_websocket_rejects_unsupported_sample_rate
covers the 22050 reject path. Spanish WS keeps native 48 kHz so the
server-side resample runs (was bypassed by client pre-resample).

Wraps the WS receive task in asyncio.wait_for(timeout=60) so a stuck
inference fails fast with the deltas seen so far. Trims docstrings,
section banners, and assertIn scaffolding.

20/20 pass, ~53s.
isinstance guards now reject non-dict session.audio, session.audio.input,
and session.audio.input.transcription with invalid_value instead of
crashing into the unrecoverable handler. Tests added for each.

buffer_overflow now reports error_type=server_error (was
rate_limit_exceeded, which would have misled clients into backoff when
the right action is reconnect).

Extracted _format_error_event so _send_error and the pre-connection
too_many_sessions path share one error-event schema. Also makes
_send_error's param/error_type keyword-only, drops a stale "per spec"
claim from the commit-failure comment, and notes the long-frame
single-inference behavior in _on_audio_append.
Replaces ad-hoc isinstance guards in _on_session_update / _on_audio_append
with Pydantic models for SessionUpdate / InputAudioBufferAppend / Commit /
Clear. ValidationError is caught in _run_loop and emitted as invalid_value
with the joined loc as param, preserving the per-field error shape the
existing reject tests expect (session.audio, session.audio.input,
session.audio.input.transcription, audio).

Also closes the WebSocket with code 1009 ("message too big") after
buffer_overflow so clients can distinguish session-resource exhaustion
from a normal close.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants