Artificial Intelligence: On the Way to the First Generated Hit

In the future, AIs will compose songs according to text input. Riffusion and MusicLM show the current research status, but the music lacks a bit of pep.

vorlesen Druckansicht

(Bild: KI-Bildgenerator midjourney)

Lesezeit: 4 Min.

(Hier finden Sie die deutsche Version des Beitrags)

AI programs have so far only delivered ready-to-use background music, but that may change: In their laboratories, AI researchers are working on clever algorithms with a larger repertoire and a better understanding of music. Eventually, the AI may compose entire operas or write the next big hit.

Initially, the developers are basically concerned with an AI improvising a new piece in response to short text instructions and playing it back as an audio file. The sound quality of the demos does not yet play a major role for them.

Two popular approaches are Riffusion and MusikLM, which Google debuted in late January. Riffusion is a hobby project by two developers, Seth Forsgren and Hayk Martiros. It is based on the well-known image AI Stable Diffusion, which converts text descriptions (so-called prompts, for example "astronaut on a horse") into complex images.

Videos by heise

For this purpose, Stable Diffusion generates coherent motifs from noise (Gaussian Noise). For training, the developers successively made an original image increasingly noisy and had the deep-learning algorithm restore it to its original state. They repeated this with countless labeled images from the Internet. Finally, they coupled the diffusion model with a large language model, so that the AI generates almost any motif from noise on text instruction.

Riffusion uses this principle to calculate spectrograms for music from noise. The images show the time course on the x-axis and the frequency distribution on the y-axis: higher tones at the top, lower tones at the bottom. Colors from blue to red or shades of gray from white to black represent the respective loudness. During playback, the software converts the drawn spectrograms into music.

To do this, the two developers trained Riffusion with the spectrograms of various free music collections with associated text descriptions. This is how the software learned what the spectrogram of a mellow jazz number with piano and double bass looks like and how it differs from that of a heavy metal guitar solo.

Riffusion generates an endless mix of music whose sound you change with text instructions.

As a result, the program on the riffusion.com website delivers an endless mix of music that slowly changes in response to English text instructions - as if a DJ were switching to a new style. Admittedly, the transitions still bump here and there and the vocals consist only of unintelligible sounds. But the AI certainly has a sense for how a disco beat differs from a piano solo.

However, the sound quality is poor: Since the denoised spectrograms only consist of 1024 × 1024 pixels, the generated tracks sound as if they were encoded with a bit rate that is too low. Even if Riffusion hangs many such recalculated spectrograms in a row, the AI can only divide the frequency spectrum into 1024 bands.

Google unveiled its MusicLM at the end of January, which is designed to generate music based on text input or a pre-sung melody. For training, the developers used a dataset with 5500 music-text pairs, which they also make available to other researchers: The music references consist of YouTube links that have been tagged by experts.

Similar to Riffusion, MusicLM generates the audio material it knows according to the user's text specifications. The musical variety here is remarkable. However, we also missed the thematic ideas in the demos released so far - the tracks just ripple along for minutes. The songs are encoded in the SoundStream codec with 24 kHz and a bit rate of 6 kBit/s, so they sound like a telephone transmission with compression artifacts.

There is still a lot of work ahead of the researchers before these interesting AI approaches develop into serious commercial services that support or even inspire music creators in their daily work: The AIs must write and vary catchy melodies, incorporate song structures and dynamic developments, and, last but not least, significantly improve sound quality. Chinese scientists at Baidu are also looking for such solutions for their ERNIE-Music system.

(hag)