A model called MusicLM, developed by Google researchers, can produce high-fidelity music from verbal descriptions like “a calming violin melody complemented by a distorted guitar riff.” It produces music at a constant 24 kHz for several minutes.
Experiments show that MusicLM outperforms more traditional techniques in terms of audio quality and text description fidelity. The researchers further demonstrate that MusicLM can be trained on both text and melody because it can modify whistled and hummed melodies to conform to a written caption’s predetermined style. Finally, the researchers provide MusicCaps, a collection of 5,500 music-text pairs with in-depth text descriptions generated by human experts, to aid future research.
MusicLM
A text-conditioned generative model called MusicLM reliably produces excellent music at a rate of 24 kHz for several minutes while staying loyal to the text-conditioning input. They also show that our approach outperforms baselines on MusicCaps, a manually selected, high-quality dataset with 5,500 musically composed text pairings.
Since their model must comprehend negations and follow the right temporal ordering of the text, some of the method’s flaws are carried over from MuLan. Additionally, their quantitative evaluations need to be improved. In particular, the MCC scores favour their approach because MCC also employs MuLan. Future studies might also focus on creating lyrics and improving text conditioning and vocal quality. Modeling complex song structures, such the introduction, verse, and chorus, is another aspect. Modeling the music with a higher sampling rate is another goal.
Conclusion
With its ability to create professional-level music from a written description, MusicLM is a new addition to the group of technologies that aid individuals in their creative endeavours. However, their model and the use-case it targets have a number of risks. The created samples will reflect the biases present in the training data, raising questions regarding cultural appropriation and whether it is fair to produce music for cultures that are underrepresented in the training data.