OpenAI expanded its real-time capabilities on 8 May with the simultaneous launch of three production-ready audio APIs, each targeting a different slice of the voice AI market.
GPT-Realtime-2 is the flagship: a voice-native reasoning model that processes speech directly without intermediate text conversion, enabling more natural conversational AI applications. Unlike previous voice APIs that converted speech to text, processed it through a language model, and then synthesised audio output, Realtime-2 maintains an end-to-end audio pipeline that preserves tone, emotion, and conversational context.
Realtime-Translate supports real-time translation across more than 70 languages, targeting applications from customer support to live event interpretation. The model handles code-switching — conversations that move between languages mid-sentence — which has historically been a weak point for translation APIs.
Realtime-Whisper is the production-grade evolution of OpenAI's Whisper transcription model, optimised for live streaming scenarios with sub-second latency. It includes speaker diarisation (identifying who said what in multi-speaker conversations), punctuation and formatting, and domain-specific vocabulary handling.
For context engineers building voice-enabled applications, the three APIs create a complete audio stack: transcribe with Realtime-Whisper, reason with Realtime-2, translate with Realtime-Translate. The pricing model follows OpenAI's standard per-token structure, with audio tokens priced at a premium over text. Developers can mix and match the three APIs or use them independently. The launch positions OpenAI against ElevenLabs (voice synthesis), Deepgram (transcription), and Google's Gemini Live (multimodal conversation) in the rapidly growing voice AI market.