Apple takes a different approach to video AI
Apple's AI division is releasing STARFlow-V, a new video AI model. It stands out for its realistic video generation using normalizing flows.
Some examples created with Apple's video AI model
(Image: Apple)
While the management of Apple's AI division is currently being restructured, researchers from the team have released a new video AI model that is causing a stir, at least within the specialist community. STARFlow-V departs from the beaten path of diffusion models, which are very widespread. Instead, the researchers use so-called normalizing flows – a technology that has played hardly any role in video generation until now.
Anyone who looks at the generated examples on the project page on Github will quickly recognize what distinguishes STARFlow-V from comparable AI models: the model generates short videos much more realistically and closer to the requirements specified in the prompt. While others show inexplicable blinking, look strikingly unrealistic, or exhibit typical AI effects like distortions, Apple's model delivers solid quality. Although the videos only have a resolution of 480p, Apple apparently aims to demonstrate feasibility here, rather than delivering a model suitable for everyday use.
What the model can do
The model, with its 7 billion parameters, can generate videos from text descriptions, extend still images into videos, and edit existing videos. The researchers trained STARFlow-V on 70 million text-video pairs and an additional 400 million text-image pairs. The model generates videos with 480p resolution at 16 frames per second and a length of up to 5 seconds per segment.
Videos by heise
Longer videos are created by gradual extension: the end of a 5-second segment serves as the starting point for the next. On the project page, Apple shows examples up to 30 seconds long. This is precisely where the strength of the unusual architecture becomes apparent. Compared to diffusion models, videos created with normalizing flows are mathematically reversible. The model can thus precisely calculate the probability of a generated video, does not require a separate encoder for input images, and trains end-to-end directly.
Calculated in chronological order
Another difference: STARFlow-V generates videos strictly autoregressively – meaning frame by frame in chronological order, so that later frames cannot influence earlier ones. Standard diffusion models, on the other hand, often denoise all frames in parallel.
The researchers have also equipped the model with a "Global-Local Architecture": coarse temporal relationships over several seconds are processed in a compact global space, while fine details within individual frames are handled locally. This is intended to prevent small errors from accumulating over longer sequences and developing a life of their own.
For acceleration, STARFlow-V uses a "video-aware Jacobi iteration": instead of calculating each value one by one, multiple blocks are processed in parallel. The first frame of a new segment is developed from the last frame of the previous one. According to Apple, the system thus achieves significant acceleration compared to standard autoregression.
Octopus escapes from the glass
In benchmarks on VBench, STARFlow-V achieves values comparable to current diffusion models – although still significantly behind commercial systems like Google's Veo 3 or Runway's Gen-3.
But things also go wrong with Apple's model: the octopus in the glass simply walks through the wall, and a hamster runs in its transparent wheel as if it were not of this world. The inference speed, despite optimizations, is still far from real-time.
What are Apple's plans?
It also remains unclear what Apple itself might intend to do with the model: it is conceivable, for example, that due to its small size, it could be used locally on devices. Its use as a world model for virtual or augmented reality would also be conceivable. And finally, it could also be useful for Apple's alleged ambitions in robotics.
Interested parties can view the code on GitHub. There is also a paper on the model available.
(mki)