Open source music generator "YuE" creates songs offline from song lyrics

The open-source model "YuE" from the Chinese-American research collective M-A-P creates minute-long songs in various styles and languages on the PC.

(Image: whiteMocca/Shutterstock.com / Bearbeitung heise online)

Feb 3, 2025 at 1:13 pm CET

4 min. read

By

Dr. Volker Zota

After the Chinese AI language models DeepSeek R1 and Alibaba Qwen 2.5 Max and the image generator DeepSeek Janus Pro caused a stir in quick succession, an AI music generator under an open source license (Apache 2.0) from the Chinese-American research collective Multimodal Art Projection (M-A-P) is now following suit. In collaboration with the Hong Kong University of Science and Technology (HKUST), it has published a series of AI models for music generation. The project "Open Music Foundation Models for Full-Song Generation" bears the ambiguous name "YuE " (乐), which means both "music" and "happiness" in Chinese. YuE can generate a complete song of several minutes in length from a given song text, which contains both a vocal part and an accompaniment. The models can handle different genres, languages and vocal techniques. The sample songs sound surprisingly coherent even after several minutes. However, all sample songs are currently only in mono, whereas the well-known AI music services Udio and Suno produce stereo music.

Unlike these, however, YuE runs offline on local hardware. However, the requirements for this are not insignificant: according to the developers, it takes around 150 seconds to generate a 30-second audio clip on an Nvidia H800 GPU and around 360 seconds on a GeForce RTX 4090.

Videos by heise

High hardware requirements

For the full version to generate entire songs, the developers recommend at least 80 GB of GPU memory, which is currently only offered by a few high-end graphics cards such as the Hopper H800 or A100 as well as several RTX 4090s in combination. For shorter excerpts such as a verse and a chorus, 24 GB of VRAM should be sufficient. If you have a suitably powerful graphics card, you can try YuE out for yourself. An installation guide on YouTube will help you set it up.

The YuE models use Meta's LLama architecture and have been trained in three stages to ensure scalability, musicality and controllability by the lyrics. A semantically enhanced audio tokenizer was used to reduce training costs. M-A-P has released variants with 1 and 7 billion parameters for English, Chinese (Mandarin and Cantonese), Japanese and Korean, as well as an upsampler model. The latter enables the generated music to be output in CD quality at 44.1 kHz.

There are numerous demo songs on the project page, here are some examples:

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externes Video (TargetVideo GmbH) geladen.

Videos immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (TargetVideo GmbH) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

YuE erzeugt unterschiedliche Stile in den Sprachen Englisch, Chinesisch, Japanisch und Koreanisch. Hier ein englischsprachiger Rap-Song

(Source: M-A-P)

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externes Video (TargetVideo GmbH) geladen.

Videos immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (TargetVideo GmbH) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

Bei diesem Jazz-Stück beginnt die KI am Ende des Songs (ab 2:20 min) zu "improvisieren", nachdem die Lyrics zu Ende sind.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externes Video (TargetVideo GmbH) geladen.

Videos immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (TargetVideo GmbH) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

Ein Beispiel mit Code Switching, also dem Wechsel zwischen verschiedenen Sprachen, im Beispiel Koreanisch, Englisch und Japanisch

(Source: M-A-P.)

The models are freely available for download on the GitHub platform and may also be used freely for commercial projects, as long as you state that the songs were generated with AI support from M-A-P. Musicians and creatives are expressly encouraged to reuse and monetize works produced by YuE.

A few days ago, the developers added "in-context learning" to their models, allowing YuE to adopt the style of a reference song. As an example, they had an AI imitation of Billie Eilish sing a song via OpenAI:

BPM control and a user-friendly interface are to be added in the future. By switching to the "Tensor library for machine learning" (GGML), the M-A-P team also hopes to reduce memory requirements.

By being open-source, the developers hope to achieve a similar breakthrough for AI music generation as the AI image generator Stable Diffusion and the Metas language model LLama have achieved in their respective fields. In order to optimize the models and extend them to more languages, the team behind YuE is looking for support, including partners to create and curate training data for fine-tuning and to evaluate the results.