GPT-Realtime มาแล้ว! OpenAI อัปเกรด Voice AI ครั้งใหญ่ ลด Latency ...

OpenAI Ships GPT-Realtime-2: Voice Agents Get GPT-5 Reasoning

OpenAI has officially moved voice agents past the “fast chatbot” phase and into the realm of active reasoning. On May 7, 2026, the company released a trio of new models—GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper—designed to replace the brittle, multi-stack pipelines developers have been hacking together for the last two years.

The headline act is GPT-Realtime-2, which OpenAI describes as its first voice model with “GPT-5-class reasoning.” This isn’t just a speed bump; it’s a structural shift toward native speech-to-speech architecture that eliminates the latency and context loss inherent in traditional transcribe-then-reason-then-speak pipelines Inworld.

The New Voice Stack: Three Specialized Models

Rather than forcing a single model to handle every audio task, OpenAI is unbundling the Realtime API into specialized primitives. This allows developers to optimize for cost and latency depending on whether they need a full agent or just a data pipe.

  1. GPT-Realtime-2 (The Flagship): A full conversational agent. It listens, reasons, calls tools, and talks back. It supports configurable reasoning effort, allowing you to trade off latency for higher-quality logic in complex workflows OpenAI Docs.
  2. GPT-Realtime-Translate: A dedicated translation pipe supporting 70+ input languages and 13 output languages. It is designed for simultaneous interpretation where the model keeps pace with the speaker DataCamp.
  3. GPT-Realtime-Whisper: A streaming speech-to-text model. Unlike the original Whisper, which processed audio in chunks, this version streams text deltas as the speaker talks, making it the new standard for live captions OpenAI Announcement.

Technical Specs and Performance Gains

The jump from the previous gpt-realtime-1.5 (released in February 2026) to version 2 is significant, particularly for enterprise-grade reliability. According to internal benchmarks, instruction following has improved by approximately 14 points, while alphanumeric transcription accuracy is up by 10.23% Oflight.

Feature GPT-Realtime-1.5 GPT-Realtime-2
Reasoning Class GPT-4o GPT-5
Context Window 32,000 tokens 128,000 tokens
Max Output 4,096 tokens 32,000 tokens
Reasoning Effort Fixed Configurable (Low to X-High)
Tool Use Standard Reliable / Multi-step

Pricing and Modality Economics

Pricing remains the primary hurdle for high-volume production. While text tokens are relatively cheap, audio tokens carry a premium that reflects the compute-heavy nature of native multimodal reasoning. However, the introduction of gpt-realtime-mini and aggressive caching discounts offer a path for builders on a budget OpenAI Pricing.

Standard GPT-Realtime-2 Rates:

  • Audio Input: $32.00 / 1M tokens ($0.40 if cached)
  • Audio Output: $64.00 / 1M tokens
  • Text Input: $4.00 / 1M tokens ($0.40 if cached)
  • Text Output: $24.00 / 1M tokens

For specialized tasks, the per-minute pricing of the Translate and Whisper models is often more predictable for business logic:

  • GPT-Realtime-Translate: $0.034 per minute
  • GPT-Realtime-Whisper: $0.017 per minute

Competitive Landscape: Monoliths vs. Modular Stacks

The market is currently split between the “OpenAI Monolith” and modular stacks. Critics on Hacker News and Reddit often point out that a custom stack—using Deepgram for STT, a flagship LLM for reasoning, and Cartesia or ElevenLabs for TTS—can sometimes achieve lower latency (sub-100ms) than OpenAI’s unified 250ms+ response time Sentiment Scan.

However, the unified architecture of GPT-Realtime-2 has a distinct advantage: it understands prosody. Because it is audio-to-audio, it can detect sarcasm, hesitation, and emotional shifts that are lost when audio is flattened into text. For developers building “human-like” assistants, this nuance is often worth the premium.

Takeaways for Builders

  • Context is King: The move to a 128K context window means your voice agents can finally remember the last 20 minutes of a conversation without losing the thread or hallucinating user details.
  • Dial Your Reasoning: Use the reasoning_effort parameter. Set it to ‘low’ for simple greetings to save on latency/cost, and ‘high’ only when the agent needs to navigate a complex API or solve a logic puzzle.
  • Translation is Now a Commodity: At $0.034/minute, real-time translation is cheap enough to embed in almost any cross-border support tool or travel app.
  • Watch the Token Bloat: Native voice models tend to be “chatty.” If you don’t prompt for brevity, the model will use filler words (um, ah, well) that sound natural but eat into your output token budget Hacker News.

Leave a Comment

Your email address will not be published. Required fields are marked *