Echovo brings AI speech synthesis entirely to the iPhone
A free app for iPhone enables text-to-speech and voice cloning. It demonstrates the possibilities of local AI on the device.
Screenshots of the Echovo app: The user interface (left) with the choice between TTS and Voice Cloning, and the display of statistics (right) after generating clips.
(Image: heise medien)
A new free iPhone app that allows texts to be voiced or voices to be cloned demonstrates the currently existing technical possibilities of running Artificial Intelligence locally on a device. Echovo by Harim Kang utilizes Alibaba Cloud's Chinese Qwen3-TTS model for this purpose. According to the developer, the entire processing takes place on-device. The results are impressive.
Qwen3-TTS is an open-source model released in January 2026 and trained with five million hours of speech data. Unlike similar models from ElevenLabs or OpenAI for Text-to-Speech (TTS), it has been specifically optimized for local inference. The Echovo app supports eleven languages. However, the model struggles with accents and dialects, which exposes the AI generation when cloning voices – nevertheless, the model captures speech melody and peculiarities quite well.
Two Models to Choose From
The developers realized the app for the iPhone using the MLX framework (Metal-accelerated Machine Learning), which utilizes GPU and Neural Engine acceleration. It ensures more efficient use of available RAM and allows the AI model to be loaded entirely into shared memory.
After installing the app, two models, each 1.9 GB in size, are available for download. The Base model is sufficient for text-to-speech with a standard voice as well as for cloning voices. With the CustomVoice model, different voices can be selected for TTS.
Videos by heise
No Cloud Costs
Depending on the device used, generation is sometimes faster than real-time audio recording – for example, when we tried the iPhone app on a Mac with M4 Pro. The app displays real-time metrics that output the Real-Time Factor, the actual processing time, RAM consumption, input length, and chip heat development. Due to on-device generation, there are no costs involved, unlike with available cloud services. For voice cloning, a three-second clip is sufficient. On an iPhone 17 Pro Max, a cloned clip was generated with an RTF of 4.074.
For best performance, a device with an A17 Pro or newer chip is recommended. Additionally, storage space is required for the downloaded models.
(mki)