Realtime and audio

Start with the outcome you want to build. Realtime sessions are best for live audio that needs low latency. Request-based audio APIs are best for files, bounded requests, or generated speech that doesn’t need a live session.

Common use cases

Voice agents

Build speech-to-speech agents that listen, reason, speak, and call tools.

Live translation

Translate live speech with a dedicated realtime translation session.

Transcription

Stream live transcript deltas or process audio files into text.

Speech generation

Turn text into natural-sounding spoken audio.

Understand different architectures

Goal	Model or API	Start here
Build a low-latency voice agent	`gpt-realtime-2`	Voice agents
Translate live speech into another language	`gpt-realtime-translate`	Realtime translation
Transcribe live audio into streaming text	`gpt-realtime-whisper`	Realtime transcription
Transcribe files or bounded audio requests	Audio transcription models	Speech to text
Generate speech from text	Speech generation models	Text to speech
Add audio to an existing Chat Completions app	Audio-capable chat models	Audio and speech

Choose a realtime session

Realtime sessions keep a connection open while your application sends audio, receives events, and updates session state.

Session type	Use when	Endpoint or pattern
Voice-agent session	The model should respond to the user, call tools, and manage conversation state.	Conversation session on `/v1/realtime`
Translation session	The app should continuously translate speech as it arrives.	Continuous translation session on `/v1/realtime/translations`
Transcription session	The app needs streaming transcript deltas without model-generated spoken responses.	Transcription session that emits transcript deltas

Use a voice-agent session when your application needs an assistant that responds to the user. Use a translation session when your application needs an interpreter that translates the speaker. Use a transcription session when your application needs text from audio without model-generated responses.

Voice-agent sessions

Voice-agent sessions use the standard Realtime API conversation lifecycle. The client connects to /v1/realtime, sends audio or text, and listens for model responses, tool calls, and session events.

For most browser voice agents, start with the Voice agents guide. It uses the Agents SDK with WebRTC for browser audio and can connect to server-side tools.

Realtime 2 adds reasoning to speech-to-speech workflows. Start with reasoning.effort set to low for most production voice agents, then adjust based on latency tolerance and task complexity. Use the Realtime prompting guide to tune reasoning, preambles, tool use, unclear audio, and exact entity capture.

Translation sessions

Realtime translation uses a dedicated translation endpoint instead of the standard voice-agent endpoint. Translation sessions are continuous: the client streams audio into the session, and the service streams translated audio and transcript deltas out.

Translation sessions don’t use the normal assistant turn lifecycle. Don’t call response.create, and don’t wait for the client to commit a user turn before translation begins. For browser media, use WebRTC. For server media pipelines such as phone calls or broadcast ingest, use WebSockets.

See Realtime translation for the dedicated endpoint, session configuration, and architecture patterns.

Transcription sessions

You can transcribe audio in more than one way. Use a realtime transcription session when your application needs live transcript deltas from streaming audio. Use the Speech to text guide for file uploads, request-based transcription, or diarization-focused workflows.

For realtime transcription, gpt-realtime-whisper gives you controllable latency. Lower delay settings produce earlier partial text, while higher delay settings can improve transcript quality. Test with your real audio conditions, target languages, accents, and domain vocabulary before choosing a production default.

See Realtime transcription for session configuration and event handling.

Choose a connection method

Choose the transport based on where your application captures and plays audio:

WebRTC

Use for browser and mobile clients that capture or play audio directly.

WebSocket

Use when your server already receives raw audio from a media pipeline, call system, or worker.

SIP

Use for telephony voice agents. Confirm model support before using SIP for translation or transcription.

Safety identifiers

If your application identifies individual end users, include a safety identifier with Realtime API requests. Safety identifiers are recommended but not required. They help OpenAI monitor and detect abuse while allowing enforcement to target an individual user rather than your entire organization. Use a stable, privacy-preserving value, such as a hashed internal user ID.

For Realtime API requests, send the identifier in the OpenAI-Safety-Identifier header. When using ephemeral tokens, set the header on the server-side request that creates the client secret so the identifier is bound to that session. When connecting from a trusted server with WebSocket or the unified WebRTC interface, set the header on the connection request.

Safety identifiers do not carry over from Responses API requests or from other sessions. If you use the Responses API safety_identifier parameter elsewhere in your application, pass the same stable value separately when you create or connect each Realtime session.

Realtime prompting guide: Prompt and tune Realtime voice models.
Managing conversations: Work with the Realtime session lifecycle.
Realtime translation: Translate live speech with a dedicated translation session.
Realtime transcription: Stream live transcript deltas from audio.
Realtime with tools: Connect function tools, MCP servers, and connectors to a Realtime session.
Webhooks and server-side controls: Control Realtime sessions from your server.
Managing costs: Track and optimize Realtime API usage.

Use Audio and speech for the core concepts behind audio input, audio output, streaming, latency, transcripts, and speech generation. Use this overview when you are ready to choose an implementation path.

Suggested

Get started

Core concepts

Agents SDK

Tools

Run and scale

Evaluation

Model optimization

Specialized models

Going live

Legacy APIs

Resources

Getting Started

Using Codex

Configuration

Administration

Automation

Learn

Releases

Core Concepts

Plan

Build

Deploy

Conversion apps

Guides

Resources

Guides

File Upload

API

Measurement

Advertiser API

API Reference

Recent

Topics

Topics

Contribute

Categories

Topics

Programs

Events

Common use cases

Understand different architectures

Choose a realtime session

Voice-agent sessions

Translation sessions

Transcription sessions

Choose a connection method

Safety identifiers

Related guides