Medieval to true crime: New language models available in OpenAI API
OpenAI offers new text-to-speech and speech-to-text models in the API. These are designed to outperform Whisper.

(Image: Shutterstock/ioda)
The knight from the Middle Ages recites a text in the style of a ballad – "may the quest be delicious". A smoky male voice is said to be particularly suitable for reciting true crime stories. The bedtime story is, of course, in a gentle female voice. And the surfer starts with the words: "Wow Dude." OpenAI has published these audio examples. However, the new text-to-speech model is not limited to them. In future, developers will be able to instruct the model to speak in a specific way that they describe.
The text-to-speech and speech-to-text models are available in the API. They are said to have been significantly improved. This relates, for example, to the word error rate of the transcription model, writes OpenAI in the blog post. The new option to determine the tone of voice should help, for example, to set up an "empathetic customer service employee".
Better than Whisper and more cost-effective
The GPT-4o and GPT-4o mini models are designed to be more cost-efficient than previous versions, not least thanks to improved model distillation, i.e. the transfer of knowledge from a large model to a smaller, more efficient model. There has also been separate training with audio data. According to OpenAI, the speech-to-text model is even better than Whisper. This is OpenAI's previous transcription tool. This is said to be due to the integration of reinforcement learning, i.e. the confirmatory learning of a model.
There is now a demo page for developers where they can try out the models. It can be found at OpenAI.fm. The Agents SDK can also be used to turn a text-based agent into a voice agent.
(emw)