> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Transcribe (Realtime / WebSocket)

GET /waves/v1/stt/live

Real-time speech-to-text over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.

## When to use this

- **Use this** for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving.
- **Use `POST /waves/v1/stt/`** when you have a complete file. Single request, single response, less plumbing.

## Model selection

Today only `?model=pulse` is supported on the streaming endpoint. `?model=pulse-pro` is rejected with `400` before WebSocket upgrade because Pulse Pro has no streaming worker; use `POST /waves/v1/stt/?model=pulse-pro` (HTTP) instead.

## How it works

1. Open a WebSocket to `wss://api.smallest.ai/waves/v1/stt/live` with `Authorization: Bearer <key>` and the session params (`model`, `language`, `sample_rate`, `encoding`, etc.) as query string.
2. Stream raw PCM (or your chosen `encoding`) over the socket as binary frames.
3. The server pushes back JSON `transcription` messages with `is_final: false` partial results as audio streams and `is_final: true` when an utterance closes.
4. Send a control message when the user pauses or the session ends:
   - **`{"type":"finalize"}`** — *turn-boundary signal*. Flushes the current audio buffer, emits one `is_final: true` transcript for that turn, and **keeps the WebSocket open** for the next user turn. Use this once per turn in a multi-turn voice agent.
   - **`{"type":"close_stream"}`** — *session-end signal*. Flushes remaining audio, emits the terminal `is_final: true` + `is_last: true` transcript, then closes the WebSocket. Use this once, at the actual end of the session (call end, app shutdown, or after a single-shot transcription buffer is fully streamed).

A multi-turn voice agent typically fires many `finalize` messages and exactly one `close_stream`. A one-off transcription of a fixed audio buffer fires only `close_stream`.

## Examples

**Python — multi-turn voice agent (recommended for Voice AI)**

Send `finalize` per user turn so the WebSocket stays open across the whole call — you pay the connection cost once, not per turn:

```python
import asyncio, json, websockets

URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16&itn_normalize=true&finalize_on_words=false&eou_timeout_ms=1000"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

async def run_voice_agent(audio_source, llm_reply, stop_event):
    async with websockets.connect(URL, additional_headers=HEADERS) as ws:
        async def stream_audio():
            async for frame in audio_source:
                if stop_event.is_set(): return
                await ws.send(frame)

        # Call this when your VAD detects end-of-turn (user paused)
        async def end_of_turn():
            await ws.send(json.dumps({"type": "finalize"}))

        async def consume():
            async for msg in ws:
                data = json.loads(msg)
                if data.get("is_last"): break          # only fires after close_stream
                if data.get("is_final"):
                    await llm_reply(data["transcript"])  # ITN-normalized full turn

        producer = asyncio.create_task(stream_audio())
        consumer = asyncio.create_task(consume())
        await stop_event.wait()                          # end of call

        await ws.send(json.dumps({"type": "close_stream"}))
        await consumer
        producer.cancel()
```

**Python — single-shot transcription**

For one-off transcription of a complete audio buffer (file, single utterance) where no further audio is coming, send `close_stream` directly after the last chunk:

```python
import asyncio, json, websockets

URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

async def transcribe_once(audio_bytes):
    async with websockets.connect(URL, additional_headers=HEADERS) as ws:
        for i in range(0, len(audio_bytes), 4096):
            await ws.send(audio_bytes[i:i+4096])
        await ws.send(json.dumps({"type": "close_stream"}))
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                print(data["transcript"])
            if data.get("is_last"):
                break

asyncio.run(transcribe_once(open("audio.pcm", "rb").read()))
```

## Common gotchas

- **`model` is required.** Missing or invalid values return `400` before the WebSocket upgrades.
- **Match `sample_rate` to your audio.** The server does not resample; mismatched rates produce garbage transcripts.
- **Existing clients** on `wss://api.smallest.ai/waves/v1/pulse/get_text` continue to work alongside this unified path.


Reference: https://docs.smallest.ai/waves/api-reference/api-reference/speech-to-text/speech-to-text

## AsyncAPI Specification

```yaml
asyncapi: 2.6.0
info:
  title: Speech to Text
  version: subpackage_speechToText.Speech to Text
  description: >
    Real-time speech-to-text over a persistent WebSocket. The fit-for-purpose
    path for live captioning, voice agents, and any flow where you need partial
    transcripts as the user is still speaking.


    ## When to use this


    - **Use this** for live audio: microphone input, voice-agent turns,
    simultaneous interpretation, low-latency captioning. Partial results stream
    back while audio is still arriving.

    - **Use `POST /waves/v1/stt/`** when you have a complete file. Single
    request, single response, less plumbing.


    ## Model selection


    Today only `?model=pulse` is supported on the streaming endpoint.
    `?model=pulse-pro` is rejected with `400` before WebSocket upgrade because
    Pulse Pro has no streaming worker; use `POST /waves/v1/stt/?model=pulse-pro`
    (HTTP) instead.


    ## How it works


    1. Open a WebSocket to `wss://api.smallest.ai/waves/v1/stt/live` with
    `Authorization: Bearer <key>` and the session params (`model`, `language`,
    `sample_rate`, `encoding`, etc.) as query string.

    2. Stream raw PCM (or your chosen `encoding`) over the socket as binary
    frames.

    3. The server pushes back JSON `transcription` messages with `is_final:
    false` partial results as audio streams and `is_final: true` when an
    utterance closes.

    4. Send a control message when the user pauses or the session ends:
       - **`{"type":"finalize"}`** — *turn-boundary signal*. Flushes the current audio buffer, emits one `is_final: true` transcript for that turn, and **keeps the WebSocket open** for the next user turn. Use this once per turn in a multi-turn voice agent.
       - **`{"type":"close_stream"}`** — *session-end signal*. Flushes remaining audio, emits the terminal `is_final: true` + `is_last: true` transcript, then closes the WebSocket. Use this once, at the actual end of the session (call end, app shutdown, or after a single-shot transcription buffer is fully streamed).

    A multi-turn voice agent typically fires many `finalize` messages and
    exactly one `close_stream`. A one-off transcription of a fixed audio buffer
    fires only `close_stream`.


    ## Examples


    **Python — multi-turn voice agent (recommended for Voice AI)**


    Send `finalize` per user turn so the WebSocket stays open across the whole
    call — you pay the connection cost once, not per turn:


    ```python

    import asyncio, json, websockets


    URL =
    "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16&itn_normalize=true&finalize_on_words=false&eou_timeout_ms=1000"

    HEADERS = {"Authorization": f"Bearer {API_KEY}"}


    async def run_voice_agent(audio_source, llm_reply, stop_event):
        async with websockets.connect(URL, additional_headers=HEADERS) as ws:
            async def stream_audio():
                async for frame in audio_source:
                    if stop_event.is_set(): return
                    await ws.send(frame)

            # Call this when your VAD detects end-of-turn (user paused)
            async def end_of_turn():
                await ws.send(json.dumps({"type": "finalize"}))

            async def consume():
                async for msg in ws:
                    data = json.loads(msg)
                    if data.get("is_last"): break          # only fires after close_stream
                    if data.get("is_final"):
                        await llm_reply(data["transcript"])  # ITN-normalized full turn

            producer = asyncio.create_task(stream_audio())
            consumer = asyncio.create_task(consume())
            await stop_event.wait()                          # end of call

            await ws.send(json.dumps({"type": "close_stream"}))
            await consumer
            producer.cancel()
    ```


    **Python — single-shot transcription**


    For one-off transcription of a complete audio buffer (file, single
    utterance) where no further audio is coming, send `close_stream` directly
    after the last chunk:


    ```python

    import asyncio, json, websockets


    URL =
    "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16"

    HEADERS = {"Authorization": f"Bearer {API_KEY}"}


    async def transcribe_once(audio_bytes):
        async with websockets.connect(URL, additional_headers=HEADERS) as ws:
            for i in range(0, len(audio_bytes), 4096):
                await ws.send(audio_bytes[i:i+4096])
            await ws.send(json.dumps({"type": "close_stream"}))
            async for msg in ws:
                data = json.loads(msg)
                if data.get("is_final"):
                    print(data["transcript"])
                if data.get("is_last"):
                    break

    asyncio.run(transcribe_once(open("audio.pcm", "rb").read()))

    ```


    ## Common gotchas


    - **`model` is required.** Missing or invalid values return `400` before the
    WebSocket upgrades.

    - **Match `sample_rate` to your audio.** The server does not resample;
    mismatched rates produce garbage transcripts.

    - **Existing clients** on `wss://api.smallest.ai/waves/v1/pulse/get_text`
    continue to work alongside this unified path.
channels:
  /waves/v1/stt/live:
    description: >
      Real-time speech-to-text over a persistent WebSocket. The fit-for-purpose
      path for live captioning, voice agents, and any flow where you need
      partial transcripts as the user is still speaking.


      ## When to use this


      - **Use this** for live audio: microphone input, voice-agent turns,
      simultaneous interpretation, low-latency captioning. Partial results
      stream back while audio is still arriving.

      - **Use `POST /waves/v1/stt/`** when you have a complete file. Single
      request, single response, less plumbing.


      ## Model selection


      Today only `?model=pulse` is supported on the streaming endpoint.
      `?model=pulse-pro` is rejected with `400` before WebSocket upgrade because
      Pulse Pro has no streaming worker; use `POST
      /waves/v1/stt/?model=pulse-pro` (HTTP) instead.


      ## How it works


      1. Open a WebSocket to `wss://api.smallest.ai/waves/v1/stt/live` with
      `Authorization: Bearer <key>` and the session params (`model`, `language`,
      `sample_rate`, `encoding`, etc.) as query string.

      2. Stream raw PCM (or your chosen `encoding`) over the socket as binary
      frames.

      3. The server pushes back JSON `transcription` messages with `is_final:
      false` partial results as audio streams and `is_final: true` when an
      utterance closes.

      4. Send a control message when the user pauses or the session ends:
         - **`{"type":"finalize"}`** — *turn-boundary signal*. Flushes the current audio buffer, emits one `is_final: true` transcript for that turn, and **keeps the WebSocket open** for the next user turn. Use this once per turn in a multi-turn voice agent.
         - **`{"type":"close_stream"}`** — *session-end signal*. Flushes remaining audio, emits the terminal `is_final: true` + `is_last: true` transcript, then closes the WebSocket. Use this once, at the actual end of the session (call end, app shutdown, or after a single-shot transcription buffer is fully streamed).

      A multi-turn voice agent typically fires many `finalize` messages and
      exactly one `close_stream`. A one-off transcription of a fixed audio
      buffer fires only `close_stream`.


      ## Examples


      **Python — multi-turn voice agent (recommended for Voice AI)**


      Send `finalize` per user turn so the WebSocket stays open across the whole
      call — you pay the connection cost once, not per turn:


      ```python

      import asyncio, json, websockets


      URL =
      "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16&itn_normalize=true&finalize_on_words=false&eou_timeout_ms=1000"

      HEADERS = {"Authorization": f"Bearer {API_KEY}"}


      async def run_voice_agent(audio_source, llm_reply, stop_event):
          async with websockets.connect(URL, additional_headers=HEADERS) as ws:
              async def stream_audio():
                  async for frame in audio_source:
                      if stop_event.is_set(): return
                      await ws.send(frame)

              # Call this when your VAD detects end-of-turn (user paused)
              async def end_of_turn():
                  await ws.send(json.dumps({"type": "finalize"}))

              async def consume():
                  async for msg in ws:
                      data = json.loads(msg)
                      if data.get("is_last"): break          # only fires after close_stream
                      if data.get("is_final"):
                          await llm_reply(data["transcript"])  # ITN-normalized full turn

              producer = asyncio.create_task(stream_audio())
              consumer = asyncio.create_task(consume())
              await stop_event.wait()                          # end of call

              await ws.send(json.dumps({"type": "close_stream"}))
              await consumer
              producer.cancel()
      ```


      **Python — single-shot transcription**


      For one-off transcription of a complete audio buffer (file, single
      utterance) where no further audio is coming, send `close_stream` directly
      after the last chunk:


      ```python

      import asyncio, json, websockets


      URL =
      "wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&sample_rate=16000&encoding=linear16"

      HEADERS = {"Authorization": f"Bearer {API_KEY}"}


      async def transcribe_once(audio_bytes):
          async with websockets.connect(URL, additional_headers=HEADERS) as ws:
              for i in range(0, len(audio_bytes), 4096):
                  await ws.send(audio_bytes[i:i+4096])
              await ws.send(json.dumps({"type": "close_stream"}))
              async for msg in ws:
                  data = json.loads(msg)
                  if data.get("is_final"):
                      print(data["transcript"])
                  if data.get("is_last"):
                      break

      asyncio.run(transcribe_once(open("audio.pcm", "rb").read()))

      ```


      ## Common gotchas


      - **`model` is required.** Missing or invalid values return `400` before
      the WebSocket upgrades.

      - **Match `sample_rate` to your audio.** The server does not resample;
      mismatched rates produce garbage transcripts.

      - **Existing clients** on `wss://api.smallest.ai/waves/v1/pulse/get_text`
      continue to work alongside this unified path.
    bindings:
      ws:
        headers:
          type: object
          properties:
            model:
              description: Reference to transcription_model
            language:
              type: string
            sample_rate:
              description: Reference to transcription_sample_rate
              default: '16000'
            encoding:
              description: Reference to transcription_encoding
              default: linear16
            word_timestamps:
              description: Reference to transcription_word_timestamps
              default: 'false'
            diarize:
              description: Reference to transcription_diarize
              default: 'false'
            eou_timeout_ms:
              type: string
              default: 800
            format:
              description: Reference to transcription_format
              default: 'true'
            punctuate:
              description: Reference to transcription_punctuate
              default: 'true'
            capitalize:
              description: Reference to transcription_capitalize
              default: 'true'
            itn_normalize:
              description: Reference to transcription_itn_normalize
              default: 'false'
            finalize_on_words:
              description: Reference to transcription_finalize_on_words
              default: 'true'
            max_words:
              type: string
            redact_pii:
              description: Reference to transcription_redact_pii
              default: 'false'
            redact_pci:
              description: Reference to transcription_redact_pci
              default: 'false'
            sentence_timestamps:
              description: Reference to transcription_sentence_timestamps
              default: 'false'
            full_transcript:
              description: Reference to transcription_full_transcript
              default: 'false'
            keywords:
              type: string
    publish:
      operationId: speech-to-text-publish
      summary: Server messages
      message:
        oneOf:
          - $ref: >-
              #/components/messages/subpackage_speechToText.Speech to
              Text-server-0-receiveTranscription
          - $ref: >-
              #/components/messages/subpackage_speechToText.Speech to
              Text-server-1-receiveTranscription
    subscribe:
      operationId: speech-to-text-subscribe
      summary: Client messages
      message:
        oneOf:
          - $ref: >-
              #/components/messages/subpackage_speechToText.Speech to
              Text-client-0-sendAudio
          - $ref: >-
              #/components/messages/subpackage_speechToText.Speech to
              Text-client-1-sendFinalize
          - $ref: >-
              #/components/messages/subpackage_speechToText.Speech to
              Text-client-2-sendClose
servers:
  waves:
    url: wss://api.smallest.ai/
    protocol: wss
components:
  messages:
    subpackage_speechToText.Speech to Text-server-0-receiveTranscription:
      name: receiveTranscription
      title: receiveTranscription
      payload:
        $ref: '#/components/schemas/transcription_transcriptionEvent'
    subpackage_speechToText.Speech to Text-server-1-receiveTranscription:
      name: receiveTranscription
      title: receiveTranscription
      payload:
        $ref: '#/components/schemas/transcription_errorEvent'
    subpackage_speechToText.Speech to Text-client-0-sendAudio:
      name: sendAudio
      title: sendAudio
      payload:
        $ref: '#/components/schemas/transcription_audioChunkOut'
    subpackage_speechToText.Speech to Text-client-1-sendFinalize:
      name: sendFinalize
      title: sendFinalize
      description: >-
        Flush the current audio buffer, run ITN over the accumulated utterance,
        and emit one `is_final` transcript. The WebSocket stays open and accepts
        audio for the next user turn. Send this once per turn in any multi-turn
        flow (voice agents, conversational STT).
      payload:
        $ref: '#/components/schemas/transcription_finalizeSignal'
    subpackage_speechToText.Speech to Text-client-2-sendClose:
      name: sendClose
      title: sendClose
      description: >-
        Flush any remaining buffered audio, emit the terminal `is_final` +
        `is_last` transcript, then close the WebSocket. Send this once at the
        end of the session — end of call, app shutdown, or after the entire
        buffer of a single-shot transcription is streamed.
      payload:
        $ref: '#/components/schemas/transcription_closeStream'
  schemas:
    ChannelsTranscriptionMessagesTranscriptionEventType:
      type: string
      enum:
        - transcription
      title: ChannelsTranscriptionMessagesTranscriptionEventType
    ChannelsTranscriptionMessagesTranscriptionEventWordsItems:
      type: object
      properties:
        word:
          type: string
        start:
          type: number
          format: double
        end:
          type: number
          format: double
        speaker:
          type: string
          description: Present when `diarize=true`.
      title: ChannelsTranscriptionMessagesTranscriptionEventWordsItems
    ChannelsTranscriptionMessagesTranscriptionEventUtterancesItems:
      type: object
      properties:
        text:
          type: string
          description: The sentence text.
        start:
          type: number
          format: double
          description: Start time in seconds.
        end:
          type: number
          format: double
          description: End time in seconds.
        speaker:
          type: string
          description: Speaker label. Present when `diarize=true`.
      title: ChannelsTranscriptionMessagesTranscriptionEventUtterancesItems
    transcription_transcriptionEvent:
      type: object
      properties:
        type:
          $ref: >-
            #/components/schemas/ChannelsTranscriptionMessagesTranscriptionEventType
        status:
          type: string
        transcription:
          type: string
        transcript:
          type: string
          description: >-
            Same content as `transcription`; the WebSocket DTO emits both fields
            for SDK compatibility.
        full_transcript:
          type: string
          description: >-
            Cumulative transcript across all utterances in this session, when
            `full_transcript=true` is set on connect.
        is_final:
          type: boolean
          description: True when this segment will not be revised.
        is_last:
          type: boolean
          description: >-
            True on the terminal segment, fires only after the client sends
            `{"type":"close_stream"}`.
        from_finalize:
          type: boolean
          description: >-
            True when this `is_final` was produced by a client-sent
            `{"type":"finalize"}` rather than the server's automatic finalizer.
            Useful in multi-turn flows for per-turn latency measurement.
        session_id:
          type: string
        language:
          type:
            - string
            - 'null'
          description: The language code Pulse detected or was pinned to for this segment.
        languages:
          type:
            - array
            - 'null'
          items:
            type: string
          description: Populated when language detection emits multiple candidates.
        words:
          type: array
          items:
            $ref: >-
              #/components/schemas/ChannelsTranscriptionMessagesTranscriptionEventWordsItems
        utterances:
          type: array
          items:
            $ref: >-
              #/components/schemas/ChannelsTranscriptionMessagesTranscriptionEventUtterancesItems
          description: >-
            Sentence-level timestamps. Present when `sentence_timestamps=true`
            is set on connect.
      title: transcription_transcriptionEvent
    ChannelsTranscriptionMessagesErrorEventType:
      type: string
      enum:
        - error
      title: ChannelsTranscriptionMessagesErrorEventType
    transcription_errorEvent:
      type: object
      properties:
        type:
          $ref: '#/components/schemas/ChannelsTranscriptionMessagesErrorEventType'
        status:
          type: string
        message:
          type: string
      title: transcription_errorEvent
    transcription_audioChunkOut:
      type: string
      format: binary
      title: transcription_audioChunkOut
    ChannelsTranscriptionMessagesFinalizeSignalType:
      type: string
      enum:
        - finalize
      description: >-
        Flush the current audio buffer and emit one `is_final: true` transcript.
        The WebSocket stays open and accepts audio for the next user turn.
      title: ChannelsTranscriptionMessagesFinalizeSignalType
    transcription_finalizeSignal:
      type: object
      properties:
        type:
          $ref: '#/components/schemas/ChannelsTranscriptionMessagesFinalizeSignalType'
          description: >-
            Flush the current audio buffer and emit one `is_final: true`
            transcript. The WebSocket stays open and accepts audio for the next
            user turn.
      required:
        - type
      title: transcription_finalizeSignal
    ChannelsTranscriptionMessagesCloseStreamType:
      type: string
      enum:
        - close_stream
      description: >-
        Flush remaining buffered audio, emit the terminal `is_final` + `is_last`
        transcript, then close the WebSocket. Send exactly once at the end of
        the session.
      title: ChannelsTranscriptionMessagesCloseStreamType
    transcription_closeStream:
      type: object
      properties:
        type:
          $ref: '#/components/schemas/ChannelsTranscriptionMessagesCloseStreamType'
          description: >-
            Flush remaining buffered audio, emit the terminal `is_final` +
            `is_last` transcript, then close the WebSocket. Send exactly once at
            the end of the session.
      required:
        - type
      title: transcription_closeStream

```