OpenAI Realtime Voice Models in 2026: GPT-Realtime-2, Live Translation, and Streaming Transcription
On May 7, 2026, OpenAI announced a new set of realtime voice models for the API in its post, Advancing voice intelligence with new models in the API. For the wider agentic ecosystem, this is one of the clearest signs that voice is moving from a chatbot feature into a full agent interface.
The release introduces three models:
- GPT-Realtime-2 for live voice interactions with GPT-5-class reasoning
- GPT-Realtime-Translate for live speech translation
- GPT-Realtime-Whisper for low-latency streaming transcription
If you build with OpenClaw or similar agent frameworks, this matters because the official OpenAI release explicitly frames voice as a path from simple conversation toward systems that can listen, reason, translate, transcribe, and take action while the conversation is still happening.
What OpenAI Actually Announced
According to the official release, GPT-Realtime-2 is OpenAI’s first voice model with GPT-5-class reasoning. OpenAI says it is designed for live interactions where the model can keep the conversation moving while it reasons through a request, calls tools, handles corrections or interruptions, and responds in a way that fits the moment.
The post also lists several concrete changes that are relevant to agent builders:
- short preambles such as “let me check that”
- parallel tool calls with audible transparency
- stronger recovery behavior when something goes wrong
- a context window increase from 32K to 128K
- adjustable reasoning effort from
minimalthroughxhigh
That combination is directly useful for anyone building voice-first automation or support agents.
The same release introduced GPT-Realtime-Translate, which OpenAI says supports more than 70 input languages and 13 output languages, and GPT-Realtime-Whisper, a new streaming speech-to-text model for live transcription.
Why This Matters for the Agentic Ecosystem
In practical terms, OpenAI is moving voice closer to the same operating model already common in text-based agents: keep context, use tools, recover from interruptions, and work across long multi-step tasks.
That lines up with where the broader ecosystem is already heading. OpenClaw users tracking voice and automation will see the overlap immediately with posts like OpenClaw 2026.5.4, where OpenClaw improved Google Meet voice handling, and Mastering Multi-Step Browser Automation for AI Agents, where the emphasis is on agents completing real workflows rather than just answering prompts.
The bigger pattern is this: voice is becoming another agent runtime, not a separate product category.
Pricing and Availability
OpenAI says all three models are available in the Realtime API. The pricing listed in the official release is:
- GPT-Realtime-2:
$32 / 1Maudio input tokens and$64 / 1Maudio output tokens, with cached input tokens at$0.40 - GPT-Realtime-Translate:
$0.034 per minute - GPT-Realtime-Whisper:
$0.017 per minute
The same post also says the Realtime API supports EU Data Residency for EU-based applications and is covered by OpenAI’s enterprise privacy commitments.
What to Watch Next
This launch does not mean every agent should become voice-first. It does mean the underlying platform pieces are becoming mature enough for voice to be treated as a serious orchestration layer.
For OpenClaw-style builders, the most important questions now are:
- when should voice trigger tools instead of just returning speech
- how should approvals work in spoken workflows
- which jobs benefit from translation or live transcription in the loop
Those questions connect naturally to The Collaborative Frontier: Humans-in-the-Loop as an Architectural First-Class Citizen and Deterministic AI Workflows, where reliability and explicit control matter more than novelty.