JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

2024 Vol.25 No.7 P.1025-1030

Suno: potential, prospects, and trends

Jiaxing YU, Songruoyao WU, Guanting LU, Zijin LI, Li ZHOU, Kejun ZHANG

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China; Department of Music Artificial Intelligence and Music Information Technology, Central Conservatory of Music, Beijing 100031, China; School of Arts and Communication, China University of Geosciences (Wuhan), Wuhan 430074, China; Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China

yujx@zju.edu.cn, wsry@zju.edu.cn, 3210105631@zju.edu.cn, lzijin@ccom.edu.cn, zhouli@cug.edu.cn, zhangkejun@zju.edu.cn

Abstract: Suno has attracted wide attention due to its impressive capabilities. It demonstrates technological advancements and opens up new possibilities for music composition, representing a milestone in the development of artificial intelligence (AI) music generation. In this paper, we first introduce the background and summarize the general technical framework of AI music generation, followed by an analysis of Suno’s advantages and disadvantages. Finally, we discuss the future trends in Music and AI.

Key words:

Chinese Summary <8> Suno：潜力、前景与趋势

俞佳兴¹，吴宋若瑶¹，卢冠廷¹，李子晋²，周莉³，张克俊^1,4
¹浙江大学计算机科学与技术学院，中国杭州市，310027
²中央音乐学院音乐人工智能与音乐信息科技系，中国北京市，100031
³中国地质大学（武汉）艺术与传媒学院，中国武汉市，430074
⁴浙江大学长三角智慧绿洲创新中心，中国嘉兴市，314100
摘要：Suno因其出色的音乐生成能力受到广泛关注，其不仅展现了音乐人工智能技术的进步，也为音乐创作开辟了新的可能，是音乐人工智能生成发展的一个里程碑。本文介绍音乐人工智能生成的技术背景，总结音乐人工智能生成的通用技术框架，分析Suno的优势和局限，并讨论音乐人工智能的未来趋势。

关键词组：音乐人工智能；音乐生成；音乐人工智能生成平台；Suno

Share this article to： More

Go to Contents

References:

<HIDE>

[1]Agostinelli A, Denk TI, Borsos Z, et al., 2023. MusicLM: generating music from text. https://arxiv.org/abs/2301.11325

[2]Al-Rfou R, Choe D, Constant N, et al., 2019. Character-level language modeling with deeper self-attention. 33^rd AAAI Conf on Artificial Intelligence, p.3159-3166.

[3]Ao JY, Wang R, Zhou L, et al., 2022. SpeechT5: unified-modal encoder-decoder pre-training for spoken language processing. Proc 60^th Annual Meeting of the Association for Computational Linguistics, p.5723-5738.

[4]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Int Conf on Neural Information Processing Systems, Article 159.

[5]Coldewey D, 2022. Try Riffusion, an AI Model That Composes Music by Visualizing It. https://techcrunch.com/2022/12/15/try-riffusion-an-ai-model-that-composes-music-by-visualizing-it/ [Accessed on Apr. 6, 2024].

[6]Copet J, Kreuk F, Gat I, et al., 2023. Simple and controllable music generation. Proc 37^th Int Conf on Neural Information Processing Systems, Article 2066.

[7]Dai ZH, Yang ZL, Yang YM, et al., 2019. Transformer-XL: attentive language models beyond a fixed-length context. Proc 57^th Conf of the Association for Computational Linguistics, p.2978-2988.

[8]Dhariwal P, Jun H, Payne C, et al., 2020. Jukebox: a generative model for music. https://arxiv.org/abs/2005.00341

[9]Freyberg K, 2024. Introducing v3. https://www.suno.ai/blog/v3 [Accessed on Apr. 6, 2024].

[10]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.

[11]Hsiao WY, Liu JY, Yeh YC, et al., 2021. Compound Word Transformer: learning to compose full-song music over dynamic directed hypergraphs. 35^th AAAI Conf on Artificial Intelligence, p.178-186.

[12]Huang CZA, Vaswani A, Uszkoreit J, et al., 2019. Music Transformer: generating music with long-term structure. 7^th Int Conf on Learning Representations.

[13]Huang QQ, Park DS, Wang T, et al., 2023. Noise2Music: text-conditioned music generation with diffusion models. https://arxiv.org/abs/2302.03917

[14]Huang YS, Yang YH, 2020. Pop Music Transformer: beat-based modeling and generation of expressive pop piano compositions. Proc 28^th ACM Int Conf on Multimedia, p.1180-1188.

[15]Kreuk F, Synnaeve G, Polyak A, et al., 2023. AudioGen: textually guided audio generation. 11^th Int Conf on Learning Representations.

[16]Liu HH, Chen ZH, Yuan Y, et al., 2023. AudioLDM: text-to-audio generation with latent diffusion models. Proc 40^th Int Conf on Machine Learning, p.21450-21474.

[17]O’Boyle M, 2023. (Re)Discovering Music Theory: AI Algorithm Learns the Rules of Musical Composition and Provides a Framework for Knowledge Discovery. https://csl.illinois.edu/news-and-media/rediscovering-music-theory-ai-algorithm-learns-the-rules-of-musical-composition-and-provides-a-framework-for-knowledge-discovery [Accessed on Apr. 6, 2024].

[18]Ouyang L, Wu J, Jiang X, et al., 2022. Training language models to follow instructions with human feedback. Proc 36^th Int Conf on Neural Information Processing Systems, Article 2011.

[19]Ren Y, He JZ, Tan X, et al., 2020. PopMAG: pop music accompaniment generation. Proc 28^th ACM Int Conf on Multimedia, p.1198-1206.

[20]Ren Y, Hu CX, Tan X, et al., 2021. FastSpeech 2: fast and high-quality end-to-end text to speech. 9^th Int Conf on Learning Representations.

[21]Touvron H, Martin L, Stone K, et al., 2023. Llama 2: open foundation and fine-tuned chat models. https://arxiv.org/abs/2307.09288

[22]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[23]Wu J, Liu XG, Hu XL, et al., 2020. PopMNet: generating structured pop music melodies using neural networks. Artif Intell, 286:103303.

[24]Wu XD, Huang ZJ, Zhang KJ, et al., 2024. MelodyGLM: multi-task pre-training for symbolic melody generation. https://arxiv.org/abs/2309.10738

[25]Yu HZ, Varshney LR, Taube H, et al., 2022. (Re)Discovering laws of music theory using information lattice learning. IEEE BITS Inform Theory Mag, 2(1):58-75.

[26]Yuan RB, Lin HF, Wang Y, et al., 2024. ChatMusician: understanding and generating music intrinsically with LLM. https://arxiv.org/abs/2402.16153

[27]Zeng ML, Tan X, Wang R, et al., 2021. MusicBERT: symbolic music understanding with large-scale pre-training. Findings of the Association for Computational Linguistics, p.791-800.

[28]Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access.

[29]Zou Y, Zou P, Zhao Y, et al., 2022. MELONS: generating melody with long-term structure using transformers and structure graph. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.191-195.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

DOI:

10.1631/FITEE.2400299

CLC number:

Download Full Text:

Click Here

Downloaded:

1161

Download summary:

Downloaded:

262

Clicked:

1246

Cited:

On-line Access:

2024-08-27

Received:

2023-10-17

Revision Accepted:

2024-05-08

Crosschecked:

2024-05-24

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE

CONTENTS

INSTR. FOR AUTHOR

FOR REVIEWER

ABOUT JZUS

Publishing Service