Transcribe (Realtime / WebSocket)
Transcribe (Realtime / WebSocket)
Transcribe (Realtime / WebSocket)
Real-time speech-to-text over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.
POST /waves/v1/stt/ when you have a complete file. Single request, single response, less plumbing.Today only ?model=pulse is supported on the streaming endpoint. ?model=pulse-pro is rejected with 400 before WebSocket upgrade because Pulse Pro has no streaming worker; use POST /waves/v1/stt/?model=pulse-pro (HTTP) instead.
wss://api.smallest.ai/waves/v1/stt/live with Authorization: Bearer <key> and the session params (model, language, sample_rate, encoding, etc.) as query string.encoding) over the socket as binary frames.transcription messages with is_final: false partial results as audio streams and is_final: true when an utterance closes.{"type":"finalize"} — turn-boundary signal. Flushes the current audio buffer, emits one is_final: true transcript for that turn, and keeps the WebSocket open for the next user turn. Use this once per turn in a multi-turn voice agent.{"type":"close_stream"} — session-end signal. Flushes remaining audio, emits the terminal is_final: true + is_last: true transcript, then closes the WebSocket. Use this once, at the actual end of the session (call end, app shutdown, or after a single-shot transcription buffer is fully streamed).A multi-turn voice agent typically fires many finalize messages and exactly one close_stream. A one-off transcription of a fixed audio buffer fires only close_stream.
Python — multi-turn voice agent (recommended for Voice AI)
Send finalize per user turn so the WebSocket stays open across the whole call — you pay the connection cost once, not per turn:
Python — single-shot transcription
For one-off transcription of a complete audio buffer (file, single utterance) where no further audio is coming, send close_stream directly after the last chunk:
model is required. Missing or invalid values return 400 before the WebSocket upgrades.sample_rate to your audio. The server does not resample; mismatched rates produce garbage transcripts.wss://api.smallest.ai/waves/v1/pulse/get_text continue to work alongside this unified path.Header authentication of the form Bearer <token>
Selects the ASR model. Today only pulse is supported on the streaming endpoint. pulse-pro returns 400 before the WebSocket upgrade because Pulse Pro has no streaming worker yet; for Pulse Pro use the HTTP endpoint (POST /waves/v1/stt/?model=pulse-pro).
Language code for transcription. Set explicitly to the known language for best accuracy. See the Pulse model card for the full 38-language list and the multi-* aggregators.
Include word-level timestamps in transcription events.
When false, strips end-of-sentence punctuation (., ,, ?, !)
from the final transcript, words[].word, and
utterances[].transcript. Does not affect casing — use capitalize
for that.
Overridden to false when format=false.
When false, lowercases the entire transcript output (final
transcript, words[].word, and utterances[].transcript).
Does not affect punctuation — use punctuate for that.
Overridden to false when format=false.
Enable Inverse Text Normalization to convert spoken-form entities
(numbers, dates, currencies, phone numbers, etc.) into written form
in finalized transcripts. Example: five five five one two three
becomes 5551234.
Overridden to false when format=false.
When false, disables automatic word-count-based finalization.
Use with itn_normalize=true for agentic pipelines where you
control finalization explicitly via {"type":"finalize"} messages.
Redact payment card information (credit-card numbers, CVV, account
numbers, etc.). Replaces matches with [ACCOUNTNUMBER_*] tokens.
Use alongside redact_pii=true for full PCI-compliant transcript
handling in voice-agent flows.
Include sentence-level timing in the transcription event under a
new utterances array. Each entry carries text, start, end,
and (when diarization is on) speaker. Combine with
word_timestamps=true to get both word and sentence boundaries.
When true, every transcription event also carries a cumulative
full_transcript field containing the concatenated final
transcripts of all utterances so far in this session. Useful for
multi-turn voice agents that want the running transcript without
tracking turns client-side.
Flush the current audio buffer, run ITN over the accumulated utterance, and emit one is_final transcript. The WebSocket stays open and accepts audio for the next user turn. Send this once per turn in any multi-turn flow (voice agents, conversational STT).
Flush any remaining buffered audio, emit the terminal is_final + is_last transcript, then close the WebSocket. Send this once at the end of the session — end of call, app shutdown, or after the entire buffer of a single-shot transcription is streamed.
Audio encoding of the bytes you stream. The server uses this to decode incoming frames; set it to match what your client is sending.
linear16, linear32: raw PCM (16-bit and 32-bit). Pair with the matching sample_rate.alaw, mulaw: 8 kHz telephony codecs. Pair with sample_rate=8000.opus, ogg_opus: Opus compressed audio (raw and Ogg container).Master formatting switch for transcript responses. When false,
forces punctuate=false, capitalize=false, and also disables
Inverse Text Normalization (ITN) so it cannot silently
reintroduce punctuation or casing.
When true, the punctuate and capitalize params take effect
independently. Leave format=true and use those two to fine-tune.
Redact personally identifiable information from the transcript.
Names → [FIRSTNAME_*] / [LASTNAME_*], phone numbers →
[PHONENUMBER_*], addresses → [ADDRESS_*], etc. The redaction
tokens use sequential indices so multiple occurrences of the same
entity get distinct labels ([FIRSTNAME_1], [FIRSTNAME_2]).
Boost recognition of specific words or phrases for this session. Useful for product names, jargon, proper nouns, and other domain-specific terms the model might otherwise mis-transcribe.
Format: a single comma-separated string (not a JSON array).
Each entry is a word or phrase, optionally followed by
:INTENSIFIER — a numeric boost multiplier. Defaults to 1.0
when omitted.
Example: I:20,smiling:26
small language model:3.5).2, 2.5, 0.5).INTENSIFIER range: 0 to 20. Recommended value is 6.
Higher values bias recognition more aggressively toward the
keyword, but also increase the risk of hallucination and
repetition in the transcript. Values of 10 or above are
not recommended — the model may insert the keyword even when
it was not spoken. Start around 3–6 and tune from there.
Wire format: pass as a query-string parameter, URL-encoded.
URLSearchParams in JavaScript and urlencode() in Python
handle the encoding for you — pass the raw string, not a
JSON-encoded array.