LLM acceleration: Apple cooperates with Nvidia

The ReDrafter software is designed to significantly speed up the execution of large language models on Nvidia GPUs. The tool is open source.

Apple Intelligence logo and icon: Apple has some catching up to do when it comes to AI.

(Image: Apple)

Dec 23, 2024 at 10:43 pm CET

3 min. read

Mac & i

By

Ben Schwan

Apple has launched a project in collaboration with Nvidia to accelerate inferencing in large language models (LLMs), which are used to establish connections between tokens, among other things. During inferencing, AI accelerators execute AI algorithms that have already been trained.

In November, the company published an open-source software called Recurrent Drafter, or ReDrafter for short, in a paper including code on GitHub. Nivida itself is already using ReDrafter in its in-house TensorRT LLM framework, as the AI giant announced in a blog post. It is a "novel, speculative decoding technique" that helps developers to "significantly accelerate" workload performance on Nvidia GPU chips.

Tested on a large production model

According to Apple, ReDrafter and TensorRT-LLM succeed in accelerating token generation by 2.7 times per second (in so-called greedy decoding). According to Apple, this was tested on a production model with several tens of billions of parameters. "The benchmark results show that this technology could significantly reduce the latency perceived by users." At the same time, performance and power are saved.

Videos by heise

According to Nvidia, speculative decoding is a process in which LLM inferencing is accelerated by generating multiple tokens in parallel. "Smaller 'design' modules are used to predict future tokens, which are then checked by the main model." With this method, the output quality is as good as before, "while response times are significantly reduced, especially with low traffic". This makes better use of the available resources.

Code is available to the entire industry

Apple emphasizes that in parallel to its work in the server area with Nvidia GPUs, it is also working on accelerating LLM inference on Apple Silicon devices. Like its competitors from Meta or OpenAI, the iPhone company apparently relies heavily on Nvidia technology when training its own LLMs. The rest of the industry should therefore also benefit from the work of the AI team. With open source models, ReDrafter is said to have succeeded in being up to 3.5 tokens faster per generation step. This has surpassed the performance of earlier speculative decoding methods.

The latest version of the TensorRT LLM framework contains both the necessary drafting and validation logic in a single engine, writes Nvidia. This minimizes overhead. The collaboration with Apple has made TensorRT-LLM "more powerful and flexible".

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externer Preisvergleich (heise Preisvergleich) geladen.

Preisvergleiche immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (heise Preisvergleich) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.