Transcribe (Realtime / WebSocket)

Real-time speech-to-text over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.

When to use this

Use this for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving.
Use POST /waves/v1/stt/ when you have a complete file. Single request, single response, less plumbing.

Model selection

Today only ?model=pulse is supported on the streaming endpoint. ?model=pulse-pro is rejected with 400 before WebSocket upgrade because Pulse Pro has no streaming worker; use POST /waves/v1/stt/?model=pulse-pro (HTTP) instead.

How it works

Open a WebSocket to wss://api.smallest.ai/waves/v1/stt/live with Authorization: Bearer <key> and the session params (model, language, sample_rate, encoding, etc.) as query string.
Stream raw PCM (or your chosen encoding) over the socket as binary frames.
The server pushes back JSON transcription messages with is_final: false partial results as audio streams and is_final: true when an utterance closes.
Send a control message when the user pauses or the session ends:
- {"type":"finalize"} — turn-boundary signal. Flushes the current audio buffer, emits one is_final: true transcript for that turn, and keeps the WebSocket open for the next user turn. Use this once per turn in a multi-turn voice agent.
- {"type":"close_stream"} — session-end signal. Flushes remaining audio, emits the terminal is_final: true + is_last: true transcript, then closes the WebSocket. Use this once, at the actual end of the session (call end, app shutdown, or after a single-shot transcription buffer is fully streamed).

A multi-turn voice agent typically fires many finalize messages and exactly one close_stream. A one-off transcription of a fixed audio buffer fires only close_stream.

Examples

Python — multi-turn voice agent (recommended for Voice AI)

Send finalize per user turn so the WebSocket stays open across the whole call — you pay the connection cost once, not per turn:

1 import asyncio, json, websockets
2 
3 URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16&itn_normalize=true&finalize_on_words=false&eou_timeout_ms=1000"
4 HEADERS = {"Authorization": f"Bearer {API_KEY}"}
5 
6 async def run_voice_agent(audio_source, llm_reply, stop_event):
7     async with websockets.connect(URL, additional_headers=HEADERS) as ws:
8         async def stream_audio():
9             async for frame in audio_source:
10                 if stop_event.is_set(): return
11                 await ws.send(frame)
12 
13         # Call this when your VAD detects end-of-turn (user paused)
14         async def end_of_turn():
15             await ws.send(json.dumps({"type": "finalize"}))
16 
17         async def consume():
18             async for msg in ws:
19                 data = json.loads(msg)
20                 if data.get("is_last"): break          # only fires after close_stream
21                 if data.get("is_final"):
22                     await llm_reply(data["transcript"])  # ITN-normalized full turn
23 
24         producer = asyncio.create_task(stream_audio())
25         consumer = asyncio.create_task(consume())
26         await stop_event.wait()                          # end of call
27 
28         await ws.send(json.dumps({"type": "close_stream"}))
29         await consumer
30         producer.cancel()

Python — single-shot transcription

For one-off transcription of a complete audio buffer (file, single utterance) where no further audio is coming, send close_stream directly after the last chunk:

1 import asyncio, json, websockets
2 
3 URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16"
4 HEADERS = {"Authorization": f"Bearer {API_KEY}"}
5 
6 async def transcribe_once(audio_bytes):
7     async with websockets.connect(URL, additional_headers=HEADERS) as ws:
8         for i in range(0, len(audio_bytes), 4096):
9             await ws.send(audio_bytes[i:i+4096])
10         await ws.send(json.dumps({"type": "close_stream"}))
11         async for msg in ws:
12             data = json.loads(msg)
13             if data.get("is_final"):
14                 print(data["transcript"])
15             if data.get("is_last"):
16                 break
17 
18 asyncio.run(transcribe_once(open("audio.pcm", "rb").read()))

Common gotchas

model is required. Missing or invalid values return 400 before the WebSocket upgrades.
Match sample_rate to your audio. The server does not resample; mismatched rates produce garbage transcripts.
Existing clients on wss://api.smallest.ai/waves/v1/pulse/get_text continue to work alongside this unified path.

Real-time speech-to-text over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking. ## When to use this - **Use this** for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving. - **Use `POST /waves/v1/stt/`** when you have a complete file. Single request, single response, less plumbing. ## Model selection Today only `?model=pulse` is supported on the streaming endpoint. `?model=pulse-pro` is rejected with `400` before WebSocket upgrade because Pulse Pro has no streaming worker; use `POST /waves/v1/stt/?model=pulse-pro` (HTTP) instead. ## How it works 1. Open a WebSocket to `wss://api.smallest.ai/waves/v1/stt/live` with `Authorization: Bearer <key>` and the session params (`model`, `language`, `sample_rate`, `encoding`, etc.) as query string. 2. Stream raw PCM (or your chosen `encoding`) over the socket as binary frames. 3. The server pushes back JSON `transcription` messages with `is_final: false` partial results as audio streams and `is_final: true` when an utterance closes. 4. Send a control message when the user pauses or the session ends: - **`{"type":"finalize"}`** — *turn-boundary signal*. Flushes the current audio buffer, emits one `is_final: true` transcript for that turn, and **keeps the WebSocket open** for the next user turn. Use this once per turn in a multi-turn voice agent. - **`{"type":"close_stream"}`** — *session-end signal*. Flushes remaining audio, emits the terminal `is_final: true` + `is_last: true` transcript, then closes the WebSocket. Use this once, at the actual end of the session (call end, app shutdown, or after a single-shot transcription buffer is fully streamed). A multi-turn voice agent typically fires many `finalize` messages and exactly one `close_stream`. A one-off transcription of a fixed audio buffer fires only `close_stream`. ## Examples **Python — multi-turn voice agent (recommended for Voice AI)** Send `finalize` per user turn so the WebSocket stays open across the whole call — you pay the connection cost once, not per turn: ```python import asyncio, json, websockets URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16&itn_normalize=true&finalize_on_words=false&eou_timeout_ms=1000" HEADERS = {"Authorization": f"Bearer {API_KEY}"} async def run_voice_agent(audio_source, llm_reply, stop_event): async with websockets.connect(URL, additional_headers=HEADERS) as ws: async def stream_audio(): async for frame in audio_source: if stop_event.is_set(): return await ws.send(frame) # Call this when your VAD detects end-of-turn (user paused) async def end_of_turn(): await ws.send(json.dumps({"type": "finalize"})) async def consume(): async for msg in ws: data = json.loads(msg) if data.get("is_last"): break # only fires after close_stream if data.get("is_final"): await llm_reply(data["transcript"]) # ITN-normalized full turn producer = asyncio.create_task(stream_audio()) consumer = asyncio.create_task(consume()) await stop_event.wait() # end of call await ws.send(json.dumps({"type": "close_stream"})) await consumer producer.cancel() ``` **Python — single-shot transcription** For one-off transcription of a complete audio buffer (file, single utterance) where no further audio is coming, send `close_stream` directly after the last chunk: ```python import asyncio, json, websockets URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16" HEADERS = {"Authorization": f"Bearer {API_KEY}"} async def transcribe_once(audio_bytes): async with websockets.connect(URL, additional_headers=HEADERS) as ws: for i in range(0, len(audio_bytes), 4096): await ws.send(audio_bytes[i:i+4096]) await ws.send(json.dumps({"type": "close_stream"})) async for msg in ws: data = json.loads(msg) if data.get("is_final"): print(data["transcript"]) if data.get("is_last"): break asyncio.run(transcribe_once(open("audio.pcm", "rb").read())) ``` ## Common gotchas - **`model` is required.** Missing or invalid values return `400` before the WebSocket upgrades. - **Match `sample_rate` to your audio.** The server does not resample; mismatched rates produce garbage transcripts. - **Existing clients** on `wss://api.smallest.ai/waves/v1/pulse/get_text` continue to work alongside this unified path.

Handshake

WSS

wss://api.smallest.ai/waves/v1/stt/live

Authentication

AuthorizationBearer

Header authentication of the form Bearer <token>

Headers

modelanyRequired

Selects the ASR model. Today only pulse is supported on the streaming endpoint. pulse-pro returns 400 before the WebSocket upgrade because Pulse Pro has no streaming worker yet; for Pulse Pro use the HTTP endpoint (POST /waves/v1/stt/?model=pulse-pro).

languagestringRequired

Language code for transcription. Set explicitly to the known language for best accuracy. See the Pulse model card for the full 38-language list and the multi-* aggregators.

sample_rateanyOptionalDefaults to 16000

Audio sample rate in Hz of the bytes you stream. Must match the actual rate of your audio source.

encodinganyOptionalDefaults to linear16

Audio encoding of the bytes you stream. The server uses this to decode incoming frames; set it to match what your client is sending. - `linear16`, `linear32`: raw PCM (16-bit and 32-bit). Pair with the matching `sample_rate`. - `alaw`, `mulaw`: 8 kHz telephony codecs. Pair with `sample_rate=8000`. - `opus`, `ogg_opus`: Opus compressed audio (raw and Ogg container).

word_timestampsanyOptionalDefaults to false

Include word-level timestamps in transcription events.

diarizeanyOptionalDefaults to false

Enable speaker diarization to identify different speakers in the audio.

eou_timeout_msstringOptionalDefaults to 800

Time in milliseconds to wait after speech ends before flushing the transcript.

formatanyOptionalDefaults to true

Master formatting switch for transcript responses. When `false`, forces `punctuate=false`, `capitalize=false`, and also disables Inverse Text Normalization (ITN) so it cannot silently reintroduce punctuation or casing. When `true`, the `punctuate` and `capitalize` params take effect independently. Leave `format=true` and use those two to fine-tune.

punctuateanyOptionalDefaults to true

When false, strips end-of-sentence punctuation (., ,, ?, !) from the final transcript, words[].word, and utterances[].transcript. Does not affect casing — use capitalize for that.

Overridden to false when format=false.

capitalizeanyOptionalDefaults to true

When false, lowercases the entire transcript output (final transcript, words[].word, and utterances[].transcript). Does not affect punctuation — use punctuate for that.

Overridden to false when format=false.

itn_normalizeanyOptionalDefaults to false

Enable Inverse Text Normalization to convert spoken-form entities (numbers, dates, currencies, phone numbers, etc.) into written form in finalized transcripts. Example: five five five one two three becomes 5551234.

Overridden to false when format=false.

finalize_on_wordsanyOptionalDefaults to true

When false, disables automatic word-count-based finalization. Use with itn_normalize=true for agentic pipelines where you control finalization explicitly via {"type":"finalize"} messages.

max_wordsstringOptional

Maximum number of words before forced finalization. Useful for keeping ITN chunks short and accurate, or for downstream LLM pipelines that want bounded segment lengths.

redact_piianyOptionalDefaults to false

Redact personally identifiable information from the transcript. Names → `[FIRSTNAME_*]` / `[LASTNAME_*]`, phone numbers → `[PHONENUMBER_*]`, addresses → `[ADDRESS_*]`, etc. The redaction tokens use sequential indices so multiple occurrences of the same entity get distinct labels (`[FIRSTNAME_1]`, `[FIRSTNAME_2]`).

redact_pcianyOptionalDefaults to false

Redact payment card information (credit-card numbers, CVV, account numbers, etc.). Replaces matches with [ACCOUNTNUMBER_*] tokens. Use alongside redact_pii=true for full PCI-compliant transcript handling in voice-agent flows.

sentence_timestampsanyOptionalDefaults to false

Include sentence-level timing in the transcription event under a new utterances array. Each entry carries text, start, end, and (when diarization is on) speaker. Combine with word_timestamps=true to get both word and sentence boundaries.

full_transcriptanyOptionalDefaults to false

When true, every transcription event also carries a cumulative full_transcript field containing the concatenated final transcripts of all utterances so far in this session. Useful for multi-turn voice agents that want the running transcript without tracking turns client-side.

keywordsstringOptional

Boost recognition of specific words or phrases for this session. Useful for product names, jargon, proper nouns, and other domain-specific terms the model might otherwise mis-transcribe. **Format:** a single comma-separated string (not a JSON array). Each entry is a word or phrase, optionally followed by `:INTENSIFIER` — a numeric boost multiplier. Defaults to `1.0` when omitted. **Example:** `I:20,smiling:26` - Phrases can include spaces (`small language model:3.5`). - Intensifier accepts integers or decimals (`2`, `2.5`, `0.5`). - Mixing entries with and without intensifiers is fine. - Maximum 100 keywords per session. **INTENSIFIER range:** `0` to `20`. Recommended value is `6`. Higher values bias recognition more aggressively toward the keyword, but also **increase the risk of hallucination and repetition** in the transcript. Values of `10` or above are not recommended — the model may insert the keyword even when it was not spoken. Start around `3–6` and tune from there. **Wire format:** pass as a query-string parameter, URL-encoded. `URLSearchParams` in JavaScript and `urlencode()` in Python handle the encoding for you — pass the raw string, not a JSON-encoded array.

Send

sendAudiostringRequiredformat: "binary"

sendFinalizeobjectRequired

Flush the current audio buffer, run ITN over the accumulated utterance, and emit one is_final transcript. The WebSocket stays open and accepts audio for the next user turn. Send this once per turn in any multi-turn flow (voice agents, conversational STT).

sendCloseobjectRequired

Flush any remaining buffered audio, emit the terminal is_final + is_last transcript, then close the WebSocket. Send this once at the end of the session — end of call, app shutdown, or after the entire buffer of a single-shot transcription is streamed.

Receive

receiveTranscriptionobjectRequired

Real-time speech-to-text over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.

When to use this

Use this for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving.
Use POST /waves/v1/stt/ when you have a complete file. Single request, single response, less plumbing.

Model selection

How it works

Open a WebSocket to wss://api.smallest.ai/waves/v1/stt/live with Authorization: Bearer <key> and the session params (model, language, sample_rate, encoding, etc.) as query string.
Stream raw PCM (or your chosen encoding) over the socket as binary frames.
The server pushes back JSON transcription messages with is_final: false partial results as audio streams and is_final: true when an utterance closes.
Send a control message when the user pauses or the session ends:
- {"type":"finalize"} — turn-boundary signal. Flushes the current audio buffer, emits one is_final: true transcript for that turn, and keeps the WebSocket open for the next user turn. Use this once per turn in a multi-turn voice agent.
- {"type":"close_stream"} — session-end signal. Flushes remaining audio, emits the terminal is_final: true + is_last: true transcript, then closes the WebSocket. Use this once, at the actual end of the session (call end, app shutdown, or after a single-shot transcription buffer is fully streamed).

A multi-turn voice agent typically fires many finalize messages and exactly one close_stream. A one-off transcription of a fixed audio buffer fires only close_stream.

Examples

Python — multi-turn voice agent (recommended for Voice AI)

Send finalize per user turn so the WebSocket stays open across the whole call — you pay the connection cost once, not per turn:

1 import asyncio, json, websockets
2 
3 URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16&itn_normalize=true&finalize_on_words=false&eou_timeout_ms=1000"
4 HEADERS = {"Authorization": f"Bearer {API_KEY}"}
5 
6 async def run_voice_agent(audio_source, llm_reply, stop_event):
7     async with websockets.connect(URL, additional_headers=HEADERS) as ws:
8         async def stream_audio():
9             async for frame in audio_source:
10                 if stop_event.is_set(): return
11                 await ws.send(frame)
12 
13         # Call this when your VAD detects end-of-turn (user paused)
14         async def end_of_turn():
15             await ws.send(json.dumps({"type": "finalize"}))
16 
17         async def consume():
18             async for msg in ws:
19                 data = json.loads(msg)
20                 if data.get("is_last"): break          # only fires after close_stream
21                 if data.get("is_final"):
22                     await llm_reply(data["transcript"])  # ITN-normalized full turn
23 
24         producer = asyncio.create_task(stream_audio())
25         consumer = asyncio.create_task(consume())
26         await stop_event.wait()                          # end of call
27 
28         await ws.send(json.dumps({"type": "close_stream"}))
29         await consumer
30         producer.cancel()

Python — single-shot transcription

For one-off transcription of a complete audio buffer (file, single utterance) where no further audio is coming, send close_stream directly after the last chunk:

1 import asyncio, json, websockets
2 
3 URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16"
4 HEADERS = {"Authorization": f"Bearer {API_KEY}"}
5 
6 async def transcribe_once(audio_bytes):
7     async with websockets.connect(URL, additional_headers=HEADERS) as ws:
8         for i in range(0, len(audio_bytes), 4096):
9             await ws.send(audio_bytes[i:i+4096])
10         await ws.send(json.dumps({"type": "close_stream"}))
11         async for msg in ws:
12             data = json.loads(msg)
13             if data.get("is_final"):
14                 print(data["transcript"])
15             if data.get("is_last"):
16                 break
17 
18 asyncio.run(transcribe_once(open("audio.pcm", "rb").read()))

Common gotchas

model is required. Missing or invalid values return 400 before the WebSocket upgrades.
Match sample_rate to your audio. The server does not resample; mismatched rates produce garbage transcripts.
Existing clients on wss://api.smallest.ai/waves/v1/pulse/get_text continue to work alongside this unified path.

Authentication

AuthorizationBearer

Header authentication of the form Bearer <token>

Headers

modelanyRequired

languagestringRequired

Language code for transcription. Set explicitly to the known language for best accuracy. See the Pulse model card for the full 38-language list and the multi-* aggregators.

sample_rateanyOptionalDefaults to 16000

Audio sample rate in Hz of the bytes you stream. Must match the actual rate of your audio source.

encodinganyOptionalDefaults to linear16

Audio encoding of the bytes you stream. The server uses this to decode incoming frames; set it to match what your client is sending.

linear16, linear32: raw PCM (16-bit and 32-bit). Pair with the matching sample_rate.
alaw, mulaw: 8 kHz telephony codecs. Pair with sample_rate=8000.
opus, ogg_opus: Opus compressed audio (raw and Ogg container).

word_timestampsanyOptionalDefaults to false

Include word-level timestamps in transcription events.

diarizeanyOptionalDefaults to false

Enable speaker diarization to identify different speakers in the audio.

eou_timeout_msstringOptionalDefaults to 800

Time in milliseconds to wait after speech ends before flushing the transcript.

formatanyOptionalDefaults to true

Master formatting switch for transcript responses. When false, forces punctuate=false, capitalize=false, and also disables Inverse Text Normalization (ITN) so it cannot silently reintroduce punctuation or casing.

When true, the punctuate and capitalize params take effect independently. Leave format=true and use those two to fine-tune.

punctuateanyOptionalDefaults to true

When false, strips end-of-sentence punctuation (., ,, ?, !) from the final transcript, words[].word, and utterances[].transcript. Does not affect casing — use capitalize for that.

Overridden to false when format=false.

capitalizeanyOptionalDefaults to true

When false, lowercases the entire transcript output (final transcript, words[].word, and utterances[].transcript). Does not affect punctuation — use punctuate for that.

Overridden to false when format=false.

itn_normalizeanyOptionalDefaults to false

Overridden to false when format=false.

finalize_on_wordsanyOptionalDefaults to true

When false, disables automatic word-count-based finalization. Use with itn_normalize=true for agentic pipelines where you control finalization explicitly via {"type":"finalize"} messages.

max_wordsstringOptional

Maximum number of words before forced finalization. Useful for keeping ITN chunks short and accurate, or for downstream LLM pipelines that want bounded segment lengths.

redact_piianyOptionalDefaults to false

Redact personally identifiable information from the transcript. Names → [FIRSTNAME_*] / [LASTNAME_*], phone numbers → [PHONENUMBER_*], addresses → [ADDRESS_*], etc. The redaction tokens use sequential indices so multiple occurrences of the same entity get distinct labels ([FIRSTNAME_1], [FIRSTNAME_2]).

redact_pcianyOptionalDefaults to false

sentence_timestampsanyOptionalDefaults to false

full_transcriptanyOptionalDefaults to false

keywordsstringOptional

Boost recognition of specific words or phrases for this session. Useful for product names, jargon, proper nouns, and other domain-specific terms the model might otherwise mis-transcribe.

Format: a single comma-separated string (not a JSON array). Each entry is a word or phrase, optionally followed by :INTENSIFIER — a numeric boost multiplier. Defaults to 1.0 when omitted.

Example: I:20,smiling:26

Phrases can include spaces (small language model:3.5).
Intensifier accepts integers or decimals (2, 2.5, 0.5).
Mixing entries with and without intensifiers is fine.
Maximum 100 keywords per session.

INTENSIFIER range: 0 to 20. Recommended value is 6. Higher values bias recognition more aggressively toward the keyword, but also increase the risk of hallucination and repetition in the transcript. Values of 10 or above are not recommended — the model may insert the keyword even when it was not spoken. Start around 3–6 and tune from there.

Wire format: pass as a query-string parameter, URL-encoded. URLSearchParams in JavaScript and urlencode() in Python handle the encoding for you — pass the raw string, not a JSON-encoded array.

Audio encoding of the bytes you stream. The server uses this to decode incoming frames; set it to match what your client is sending.

linear16, linear32: raw PCM (16-bit and 32-bit). Pair with the matching sample_rate.
alaw, mulaw: 8 kHz telephony codecs. Pair with sample_rate=8000.
opus, ogg_opus: Opus compressed audio (raw and Ogg container).

When true, the punctuate and capitalize params take effect independently. Leave format=true and use those two to fine-tune.

Boost recognition of specific words or phrases for this session. Useful for product names, jargon, proper nouns, and other domain-specific terms the model might otherwise mis-transcribe.

Format: a single comma-separated string (not a JSON array). Each entry is a word or phrase, optionally followed by :INTENSIFIER — a numeric boost multiplier. Defaults to 1.0 when omitted.

Example: I:20,smiling:26

Phrases can include spaces (small language model:3.5).
Intensifier accepts integers or decimals (2, 2.5, 0.5).
Mixing entries with and without intensifiers is fine.
Maximum 100 keywords per session.

URL	wss://api.smallest.ai/waves/v1/stt/live
Method	GET
Status	101 Switching Protocols

1	import asyncio, json, websockets
2
3	URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16&itn_normalize=true&finalize_on_words=false&eou_timeout_ms=1000"
4	HEADERS = {"Authorization": f"Bearer {API_KEY}"}
5
6	async def run_voice_agent(audio_source, llm_reply, stop_event):
7	async with websockets.connect(URL, additional_headers=HEADERS) as ws:
8	async def stream_audio():
9	async for frame in audio_source:
10	if stop_event.is_set(): return
11	await ws.send(frame)
12
13	# Call this when your VAD detects end-of-turn (user paused)
14	async def end_of_turn():
15	await ws.send(json.dumps({"type": "finalize"}))
16
17	async def consume():
18	async for msg in ws:
19	data = json.loads(msg)
20	if data.get("is_last"): break # only fires after close_stream
21	if data.get("is_final"):
22	await llm_reply(data["transcript"]) # ITN-normalized full turn
23
24	producer = asyncio.create_task(stream_audio())
25	consumer = asyncio.create_task(consume())
26	await stop_event.wait() # end of call
27
28	await ws.send(json.dumps({"type": "close_stream"}))
29	await consumer
30	producer.cancel()

When to use this

Model selection

How it works

Examples

Common gotchas

HandshakeTry it

Authentication

Headers

Send

Receive

When to use this

Model selection

How it works

Examples

Common gotchas

HandshakeTry it

Authentication

Headers

Send

Receive

Handshake

Handshake