Resource-saving: Efficient AI language models without matrix multiplication

Researchers are developing language models that can manage without memory-intensive matrix multiplications and still compete with modern transformers.

(Image: Alexander Lukatskiy/Shutterstock.com)

Jun 27, 2024 at 9:56 pm CEST

4 min. read

By

Dr. Volker Zota

This article was originally published in German and has been automatically translated.

Researchers from the USA and China have presented a new method for optimizing AI language models. The aim is for large language models (LLMs) to require significantly less memory and computing power than current LLMs and still deliver comparable quality results. These models, called "MatMul-free Language Models", aim to achieve this by largely dispensing with resource-intensive matrix multiplications (MatMul). Matrix multiplications are the central computational operations of deep learning or the special transformer architectures used by large language models such as GPT-3/4 or PaLM. These calculations are responsible for the majority of the resource requirements and represent a hurdle for scaling the models. Put simply, "MatMul-free language models" replace as many complex calculations as possible with simpler additions.

Additions instead of multiplications

The architecture of MatMul-free models uses additive operations in dense layers (also known as fully connected layers) – basic building blocks of neural networks – as well as pairwise multiplications of numbers in lists (Hadamard products). So-called ternary weights also replace MatMul operations with simple additions and subtractions. While ordinary weights can take on tens of values, ternary weights are limited to the values -1, 0 and 1, which significantly simplifies and speeds up the calculations. The research team also optimized a special network architecture, the so-called "Gated Recurrent Unit" (GRU). The GRU stores and updates information over certain periods of time, making it something like the (short-term) memory of the neural network. Through targeted adaptations, the scientists succeeded in modifying the GRU in such a way that it only performs elementary computing operations. If you want to delve deeper into the mathematics and challenges of such models, you can find the publication "Scalable MatMul-free Language Modeling"(PDF) as a preprint on arXiv.org.

According to the team, the simplifications enabled them to optimize models intended for GPUs in such a way that the memory requirements during training could be reduced by up to 61 percent. During inference, i.e. when using the models for prediction, the memory consumption was even reduced more than tenfold thanks to special kernels, according to the publication. The researchers also developed a hardware solution based on an Intel D5005 Stratix 10 Field Programmable Gate Array (FPGA), which could run a customized MatMul-free LLM with 2.7 billion parameters at a power consumption of just 13 watts, while GPU-based systems of the same speed would require several hundred watts to run such models. According to the lead author of the study, Jason Eshraghian, the MatMul-free models still offer comparable performance to high-precision transformers, but have significantly lower memory requirements. Eshraghian emphasizes that the performance advantage of conventional transformers decreases with increasing model size.

Comparison with other transformers

Comparisons with modern transformer architectures have shown that the new model performs competitively on various benchmark datasets. However, only a comparison with the Transformer++ optimized for translations has been made so far.

Eshraghian sees the work on the MatMul-free models as an important contribution to optimizing the development of future hardware accelerators. The results could be particularly interesting for the use of LLMs on devices with limited resources, such as smartphones or embedded devices.