A small audio model launch --
gpt-4o-transcribe-diarize
This is a diarization-focused ASR model, it's big and slow so we recommend running it offline, but it excels at differentiating speakers, and you can provide voice samples for known speakers up front.
I joined OpenAI at the beginning of the year -- partly because I was excited about the possibility of better voice interaction with computers. So it was *especially* amazing to work with the team here on the gpt-4o model launch.
It's hard to grok until you try it how big of a
I can’t overemphasize how good the new realtime speech2speech model is at function calling. It is fast and accurate with native audio input. It exceeded expectations from myself and posttraining researchers.
This one —
gpt-4o-realtime-preview-2025-06-03
Heads up -- we're shifting the OpenAI model for the Realtime API gpt-4o-realtime-preview to point to gpt-4o-realtime-preview-2024-12-17. This model has some valuable improvements, if you use the dateless model things should get magically better.
A new feature for the Realtime API -- you can now set "language" and "prompt" for input audio transcription. This was requested by lots of users, it should make a big difference if you rely on transcription accuracy and know the language or expected keywords.
New feature launching today on the Realtime API:
🟤Semantic VAD🟤.
This is a custom turn detection model that uses the *content* of speech to tell if the user is done. This is a huge improvement on cases where the user pauses and the model incorrectly interrupts.