MusicLM — Google's AI music generator — is like ChatGPT for audio

It can write 5-minute songs based on short text prompts.

April 23, 2023

a pair of headphones with a pattern on them.

Credit: Adobe Stock, Annelisa Leinbach

Over the past several months, increasingly sophisticated generative AI has shown a wide range of capabilities, from solving math problems to writing poetry to creating art.
MusicLM is an AI music generator unveiled by Google in January 2023.
The audio it produces sounds similar to human-written music, but MusicLM still cannot replicate traditional song structures, and the vocals it creates are particularly poor quality, with unintelligible lyrics.

Google has unveiled an advanced AI music generator that can turn a snippet of text into a song — but legal concerns might prevent the tech giant from ever sharing it with the public.

The AI revolution: ChatGPT, DALL-E 2, and other advanced AIs capable of generating impressive text or images in response to user prompts exploded in popularity in 2022, but they weren’t the first generative AIs, nor the only examples of what the neural networks can do.

Several companies have also trained AIs to generate music in response to text, audio, or image prompts — OpenAI, the research firm behind ChatGPT and DALL-E 2, even released an AI music generator called “Jukebox” back in 2020.

These systems haven’t been as enthusiastically embraced as their text- and image-generating counterparts, though, mainly because their outputs aren’t as impressive — most are low-fidelity, simplistic, and lacking in traditional song structures, such as repeating choruses.

Introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We're releasing a tool for everyone to explore the generated samples, as well as the model and code: https://t.co/EUq7hNZv62 pic.twitter.com/sh5yHz7qrc
— OpenAI (@OpenAI) April 30, 2020

What’s new? Music-making AIs are getting better, though, and perhaps the most impressive example of the technology is MusicLM, an AI music generator unveiled by Google in January 2023.

The system can generate clips up to 5 minutes long based on text descriptions, and while the music isn’t going to win any Grammys, the audio does sound more like something a human might record than the clips generated by other AIs.

How it works: Google trained MusicLM on more than 280,000 hours of music sourced from MuLan, a model trained to link music to descriptions written in natural language.

They then created MusicCaps, a publicly accessible dataset of more than 5,500 music clips to use to evaluate the AI music generator. Expert musicians wrote captions for each of these clips, as well as lists of aspects to describe them, such as their genre or mood.

During the evaluation stage, Google pitted MusicLM against two other text-to-music AIs — Mubert and Riffusion — using several quantitative metrics for assessing a clip’s audio quality and adherence to a text description.

They also presented human evaluators with MusicCaps’ descriptions and two audio clips — these might be two clips produced by AIs or one AI-generated clip and the music upon which the MusicCaps description was based. The evaluators then chose which of the clips they thought best matched the description.

According to a paper Google shared on the preprint server arXiv, MusicLM outperformed the other AIs across the board.

“We strongly emphasize the need for more future work in tackling these risks associated to music generation.”
AGOSTINELLI ET AL.

Looking ahead: Google’s AI music generator may be able to produce audio that sounds closer to human-written music, but it still can’t replicate traditional song structures, and the vocals it creates are particularly poor quality, with unintelligible lyrics.

Google says future work on the system could focus on those issues, improving the overall quality of the audio, and addressing the problem that’s preventing it from releasing the MusicLM to the public: about 1% of its output can be approximately matched to audio in its training data.

“We acknowledge the risk of potential misappropriation of creative content associated to the use case … We strongly emphasize the need for more future work in tackling these risks associated to music generation,” the researchers wrote.

This article was originally published by our sister site, Freethink.

Philosophy

Science & Tech

Mind & Behavior

Business

History & Society