FastVLM: Apple's new image-to-text AI should be significantly faster
A new research paper gives a foretaste of future Apple Intelligence models. The focus remains on local processing on the device.
(Image: Apple)
When Apple Intelligence began to emerge in spring 2024, research publications by Apple engineers were the first harbingers. Looking back, it was already clear at the time what Apple was focusing on in its own AI models, including the ability to run some models locally on iPhones and other devices. If that happens again this year, we can expect a much better model for the iPhone manufacturer's AI functions that can recognize and process images locally on the device. A recently published research paper on FastVLM (Fast Vision Language Models) provides initial details on this.
According to the paper, FastVLM should be characterized above all by a higher speed. This is also reported in Apple's Machine Learning Research Blog. The FastVLM-0.5B variant is 85 times faster than LLaVA-OneVision, while the 7B variant is 7.9 times faster than Cambrian-1-8B with comparable accuracy. In addition, the model is very small and can be operated locally on Apple devices, which means that users remain independent of the cloud and the model meets high data protection standards. It therefore fits in well with the previous focus areas of Apple Intelligence.
New encoder for high-resolution images
(Image:Â Apple)
The basis for the faster image processing is the new FastViTHD Vision Encoder, which processes high-resolution images more efficiently than other models. There is no need for prior reduction. The encoder nevertheless produces significantly fewer visual tokens. Less training data was also required for the model.
For users, the fast processing means that text descriptions of images can be created much faster with the model and previous waiting times are eliminated. Possible applications also include document analysis (OCR), accessibility functions and visual searches in photo libraries.
Examples of use
In three examples, the Apple researchers show what the model can do and how quickly it works. In one test case, the number of fingers shown by a hand in a video is counted. In another example, a notepad is scrolled through quickly and the handwritten notes it contains are recognized in real time. In the third example, the AI describes an emoji that is shown to it.
Videos by heise
Apple already uses image recognition in various places in its operating system and apps. These include Visual Intelligence for object recognition or visual search in the Photos app. These functions should work faster and better with the new model. Other applications are also conceivable, such as an additional image description in the Mail app or an assistant in the Camera app.
Developer conference in June
Whether the new model will make it into iOS 19 will be revealed on June 9, when Apple presents iOS 19 and the other new versions of the operating systems at a keynote at the opening of the WWDC developer conference.
(mki)