OpenAI Realtime API: What has happened in the year since the beta

The public release improves audio, speech, debugging, and developer experience. Additionally, a more cost-effective mini variant can be used.

listen Print view
Chatbot and Humans

(Image: pncha.me/Shutterstock.com)

14 min. read
By
  • Marius Obert
Contents

Nearly a year after the developer preview was introduced, OpenAI released the GA version (General Availability) of the Realtime API in August 2025. The Realtime API is a multimodal interface that allows audio and text data to be exchanged directly with a language model with very low latency. The Developer Day in October 2025, a few months after the GA launch, brought additional innovations, including new tools, price changes, and a smaller, faster model variant.

Marius Obert
Marius Obert

Marius Obert enjoys building prototypes with the latest cloud technologies and loves talking about them even more. He started his career in UI development in sunny California. During this time, he learned to love web technologies such as JavaScript in general and the entire Node.js ecosystem in particular.

With the GA version, OpenAI significantly expands the possibilities for interaction with AI agents. In a blog post, the company presents application examples from partners such as Zillow, T-Mobile, StubHub, Oscar Health, and Lemonade, which illustrate the diversity of use cases. The application fields go beyond classic language dialogues and enable so-called "hands-free interactions," where users can flexibly combine text, voice, and visual inputs. The Realtime API is not exclusively designed as a voice-to-voice solution, but as a multimodal system that accepts text, audio, and images equally as input. Voice interaction thus represents a complementary communication channel alongside other forms of use.

Why Realtime APIs are Relevant

Applications are becoming increasingly interactive, and users expect immediate, natural responses. The Realtime API fulfills this need by enabling continuous, bidirectional communication with very low latency – for example, for voice assistance in customer support, automatic note-takers in the office, or applications that combine live visualizations and voice.

By eliminating traditional intermediate steps such as separate speech-to-text and text-to-speech processes, a direct model is created that understands and answers voice without perceptible delay or loss of nuance.

Compared to the Developer Preview, the GA version of the Realtime API contains numerous technical extensions and improvements in the areas of model architecture, integration, and usability. A central change is the introduction of a mini variant of the model, which enables more cost-effective and faster applications. The OpenAI website shows the differences between gpt-realtime and gpt-realtime-mini.

Heise Conference: enterJS Integrate AI
enterJS Integrate AI

When is Voice AI useful in web applications, and when is it not? Marius Oberts's talk at the online themed day enterJS Integrate AI on April 28, 2026. Early bird tickets and group discounts in the online ticket shop.

The audio quality has been significantly revised: the generated speech sounds more natural and expressive, with finer intonations, smoother pauses, and better adaptation to conversation flows. OpenAI has introduced two new voices for this: "Cedar" and "Marin."

The provider has also improved the ability to follow complex instructions. The model responds more precisely to system and developer prompts, and it can read texts exactly, reproduce alphanumeric sequences correctly, and switch fluently between languages. Benchmark tests such as the Big Bench Audio Evaluation indicate an increase in accuracy from around 65 percent in the beta version to over 82 percent in the GA version:

OpenAI Realtime API: Results of the Big Bench Audio Intelligence Benchmark

(Image: OpenAI)

A significant improvement concerns the model's enhanced rhetorical capabilities. The Realtime API can now make conversation flows smoother, more natural, and more context-aware by better interpreting pauses, intonations, and conversational dynamics. In this context, OpenAI introduces the Conversation Idle Timeouts function. If the model detects no input over a defined period, it can automatically output follow-up sentences such as "Are you still there?" to maintain the conversation flow and signal an active conversation to the user.

Furthermore, during longer or asynchronous function calls, the model responds with intermediate texts, such as "I'm still waiting for the result," to bridge waiting times communicatively and keep the dialogue lively. These additions contribute to making the interaction more natural and consistent – especially in use cases where real-time feedback and spoken intermediate outputs are crucial for the user experience.

In addition, the OpenAI Realtime API now supports the Session Initiation Protocol (SIP) in addition to WebSocket and WebRTC, which facilitates direct integration into telephony and contact center systems.

For developers, the provider has revised the structures of the event ("Event") and message items to simplify debugging and error handling. This model also now has EU data residency to comply with European data protection requirements.

Pre-built tools such as Web-Search or Code Interpreter are not yet integrated, so users have to rebuild them themselves. Via the implementation of the MCP server (Model Context Protocol), it is possible to integrate external tools into the agent logic. This allows the capabilities of an agent to be extended relatively easily and integrated into existing tools of larger applications. These changes collectively increase the robustness, flexibility, and practical applicability of the API.

The following table provides an overview of the key model parameters and pricing structures for gpt-realtime and gpt-realtime-mini.

Model Context Window (Tokens) Max Output (Tokens) Knowledge Cutoff Input Types Output Types Price per 1 Million Input Tokens (Audio) Price per 1 Million Output Tokens (Audio)
gpt-realtime 32,000 4,096 Oct. 2023 Text, Image, Audio Text, Audio $ 32.00 (Cached: $ 0.40) $ 64.00
gpt-realtime-mini 32,000 4,096 Oct. 2023 Text, Image, Audio Text, Audio $ 10.00 (Cached: $ 0.30) $ 20.00

Table 1: Comparison of the core parameters of both models; cost per million tokens in US dollars ($)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.