Speech-to-text: Apple's new APIs outperform Whisper on speed
Apple significantly improves its Speech-to-Text APIs with iOS 26 and macOS Tahoe. Tests show: The higher speed comes at the expense of accuracy.
lassedesignnen/Shutterstock.com
(Image: Shutterstock)
Apple is significantly improving the transcription of live audio or recordings in its upcoming operating system versions. The performance has now been compared with other common speech recognition models in various tests. However, the results are mixed: Apple's new API, which is provided in iOS 26, iPadOS 26 and macOS 26 Tahoe, is significantly faster than the widespread Whisper model from OpenAI, for example. However, there is still room for improvement in terms of accuracy.
The Apple news blog MacStories tested the improved speech framework with a 34-minute video file. A tool called Yap, which can be found on GitHub, was used to test Apple's APIs for transcription. It completed the task in just 45 seconds, while the rather popular MacWhisper tool with its large models took between 1:41 minutes and 3:55 minutes.
How the models compare
The news site 9to5Mac pitted Apple's API against NVIDIA Parakeet, which is considered very fast, and against OpenAI Whisper Large V3 Turbo. The test computer was a MacBook Pro with M2 Pro and 16 GB of unified memory. While Parakeet completed the 7:31 minute audio file in 2 seconds, Apple's transcription took 9 seconds. The OpenAI model was only finished after 40 seconds. The longer the audio file, the further apart the models were in terms of time.
However, Whisper's slowness paid off in terms of accuracy. A distinction was made between the proportion of character errors (Character Error Rate, CER) and word errors (Word Error Rate, WER). On average, Whisper Large V3 Turbo proved to be the most accurate solution with a CER of 0.3 percent and a WER of 1 percent. Apple had an average error rate of 3 percent for characters and 8 percent for words. Parakeet lags far behind with a CER of 7 percent and a WER of 12 percent.
Videos by heise
What Apple's API is recommended for
As a result, Apple's transcription promises a clear speed advantage over Whisper and does not make as many errors as the NVIDIA model. The testers came to the conclusion that the choice of model is therefore primarily a question of the application purpose. Apple's model is recommended for time-critical applications such as live subtitles or the rough transcription of longer content for indexing. Whisper has the edge when only minimal post-processing is required or for applications where accuracy is important.
(mki)