Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training
Abstract
:1. Introduction
- Score Images as a Modality. We propose the SIM model, which integrates music score images alongside MIDI data in multimodal pre-training. To our knowledge, we are the first to utilize score images as a distinct modality, thereby enriching the model’s comprehension of symbolic music.
- Novel Pre-training Tasks. We introduce the masked bar-attribute modeling and score-MIDI matching pre-training tasks to enable the SIM model to capture music structures and effectively align visual and symbolic representations. This enhances the model’s ability to interpret diverse musical content.
- Specialized Dataset Creation. We meticulously compile a comprehensive dataset comprising matched pairs of score images and MIDI representations. This dataset is specifically designed to train the SIM model effectively. It is carefully crafted to optimize the training process and highlight the significance of incorporating score images as a modality in multimodal pre-training, thereby enhancing the understanding of symbolic music.
2. Related Works
3. The Proposed Method
3.1. MIDI-Score Pre-Training Dataset
- Stylistic harmonization. Recognizing the diversity in musical styles, we employed a stratified sampling approach during the dataset assembly. This method ensured proportional representation of various styles, reflecting the eclectic nature of the source datasets. Additionally, stylistic metadata were utilized to annotate each piece in our dataset, allowing for targeted sampling and analysis based on style.
- Complexity harmonization. Addressing the complexity, we developed a normalization algorithm that adjusts for differences in note density, pitch range, and rhythmic variety. This algorithm standardizes the representation of musical complexity, ensuring that pieces of varying difficulty levels are equitably represented and can be effectively processed by our model.
- Transformation to score images. To create a consistent representation for multi-modal pre-training, we transformed the MIDI data into score images using a uniform conversion process. Each MIDI composition was converted into a single score image, encapsulating only the initial 12 bars of music (less than 12 bars are converted). This decision was made to maintain computational feasibility while capturing the essence of each piece. The conversion tool utilized is (http://midisheetmusic.com/index.html) (accessed on 31 July 2024), and we ensured that the process was applied consistently across all datasets.
- Dataset Standardization: To manage computational complexity and ensure uniformity, we implemented a standardization protocol that includes cleansing and deduplication. This process was modeled after the procedures used in MusicBERT [8], applied to both MSD-small and MSD-large subsets. MSD-small, derived from Pop1K7, was subjected to the same rigorous preprocessing as described for MusicBERT, yielding 1589 songs and 1589 MIDI-Score pairs for base pre-training (i.e., MSD-small is used for the base pre-training dataset). MSD-large, an amalgamation of Pop1K7, MAESTRO, GiantMIDI-Piano, and Lakh-MIDI datasets, was processed similarly, resulting in a substantial dataset comprising 159,907 songs and 159,907 MIDI-Score pairs for large-scale pre-training (i.e., MSD-large is used for the large-scale pre-training dataset).
3.2. MIDI-Symbolized Representation
3.3. Score Image Representation
3.4. Representation Embedding and Integration Processing
3.5. MIDI-Score Pre-Training Strategies
4. Experiments
4.1. Experiments Setup
4.2. Note-Level Understanding Tasks
4.3. Song-Level Understanding Tasks
4.4. Phrase-Level Understanding Tasks
4.5. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, T.; Ogihara, M. Toward Intelligent Music Information Retrieval. IEEE Trans. Multimed. 2006, 8, 564–574. [Google Scholar] [CrossRef]
- Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Simon, I.; Hawthorne, C.; Shazeer, N.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck, D. Music Transformer: Generating Music with Long-Term Structure. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Qin, Y.; Xie, H.; Ding, S.; Tan, B.; Li, Y.; Zhao, B.; Ye, M. Bar Transformer: A Hierarchical Model for Learning Long-Term Structure and Generating Impressive Pop Music. Appl. Intell. 2023, 53, 10130–10148. [Google Scholar] [CrossRef]
- Oore, S.; Simon, I.; Dieleman, S.; Eck, D.; Simonyan, K. This Time with Feeling: Learning Expressive Musical Performance. Neural Comput. Appl. 2020, 32, 955–967. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Advances in the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Zeng, M.; Tan, X.; Wang, R.; Ju, Z.; Qin, T.; Liu, T.Y. MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 791–800. [Google Scholar] [CrossRef]
- Chou, Y.H.; Chen, I.C.; Chang, C.J.; Ching, J.; Yang, Y.H. MidiBERT-Piano: Large-Scale Pre-Training for Symbolic Music Understanding. 2021. Available online: https://www.researchgate.net/publication/353208453_MidiBERT-Piano_Large-scale_Pre-training_for_Symbolic_Music_Understanding (accessed on 31 July 2024).
- Fu, Y.; Tanimura, Y.; Nakada, H. Improve Symbolic Music Pre-Training Model Using MusicTransformer Structure. In Proceedings of the 2023 17th International Conference on Ubiquitous Information Management and Communication (IMCOM), Seoul, Republic of Korea, 3–5 January 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Shen, Z.; Yang, L.; Yang, Z.; Lin, H. More Than Simply Masking: Exploring Pre-Training Strategies for Symbolic Music Understanding. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, Thessaloniki, Greece, 12–15 June 2023; pp. 540–544. [Google Scholar] [CrossRef]
- Jeong, D.; Kwon, T.; Kim, Y.; Nam, J. Graph Neural Network for Music Score Data and Modeling Expressive Piano Performance. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3060–3070. [Google Scholar]
- Tsai, T.J.; Ji, K. Composer Style Classification of Piano Sheet Music Images Using Language Model Pretraining. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Online, 12–16 October 2020; pp. 176–183. [Google Scholar]
- Yang, D.; Ji, K.; Tsai, T. A Deeper Look at Sheet Music Composer Classification Using Self-Supervised Pretraining. Appl. Sci. 2021, 11, 1387. [Google Scholar] [CrossRef]
- Kim, W.; Son, B.; Kim, I. VILT: Vision-and-Language Transformer without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified Vision-Language Pre-Training for Image Captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13041–13049. [Google Scholar] [CrossRef]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Glasgow, UK, 2021; Volume 34, pp. 9694–9705. [Google Scholar]
- Chuan, C.H.; Agres, K.; Herremans, D. From Context to Concept: Exploring Semantic Relationships in Music with Word2vec. Neural Comput. Appl. 2020, 32, 1023–1036. [Google Scholar] [CrossRef]
- Wu, T.; Zhang, J.; Duan, L.; Cai, Y. Music-Graph2Vec: An Efficient Method for Embedding Pitch Segment. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia ’23, Auckland, New Zealand, 3–6 December 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Liang, H.; Lei, W.; Chan, P.Y.; Yang, Z.; Sun, M.; Chua, T.S. PiRhDy: Learning Pitch-, Rhythm-, and Dynamics-Aware Embeddings for Symbolic Music. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 574–582. [Google Scholar] [CrossRef]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. MPNet: Masked and Permuted Pre-Training for Language Understanding. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Glasgow, UK, 2020; Volume 33, pp. 16857–16867. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. MASS: Masked Sequence to Sequence Pre-Training for Language Generation. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5926–5936. [Google Scholar]
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-Training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Lin, Y.; Dai, Z.; Kong, Q. MusicScore: A Dataset for Music Score Modeling and Generation. arXiv 2024, arXiv:2406.11462. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Hsiao, W.Y.; Liu, J.Y.; Yeh, Y.C.; Yang, Y.H. Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 178–186. [Google Scholar] [CrossRef]
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.Z.A.; Dieleman, S.; Elsen, E.; Engel, J.; Eck, D. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Kong, Q.; Li, B.; Chen, J.; Wang, Y. GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music. Trans. Int. Soc. Music Inf. Retr. 2022, 5, 87–98. [Google Scholar] [CrossRef]
- Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. Ph.D. Thesis, Columbia University, New York, NY, USA, 2016. [Google Scholar] [CrossRef]
- Huang, Y.S.; Yang, Y.H. Pop Music Transformer: Beat-Based Modeling and Generation of Expressive Pop Piano Compositions. In Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 12 October 2020; pp. 1180–1188. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Wang, Z.; Chen, K.; Jiang, J.; Zhang, Y.; Xu, M.; Dai, S.; Gu, X.; Xia, G. POP909: A Pop-Song Dataset for Music Arrangement Generation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Virtual, 11–16 October 2020; pp. 38–45. [Google Scholar]
- Lin, Z.; Feng, M.; dos Santos, C.N.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A Structured Self-Attentive Sentence Embedding. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Hung, H.T.; Ching, J.; Doh, S.; Kim, N.; Nam, J.; Yang, Y.H. EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-Based Music Generation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Online, 7–12 November 2021. [Google Scholar]
- Chuan, C.H.; Herremans, D. Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks with a Novel Image-Based Representation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2159–2166. [Google Scholar] [CrossRef]
Datasets | Usage | # Songs | # MIDI-Score Pairs |
---|---|---|---|
Pop1K7 | pre-training (base) | 1747 | × |
MAESTRO | pre-training (large) | 1184 | × |
GiantMIDI-Piano | pre-training (large) | 10,854 | × |
Lakh-MIDI | pre-training (large) | 176,581 | × |
MSD-small (Ours) | pre-training (base) | 1589 | 1589 |
MSD-large (Ours) | pre-training (large) | 159,907 | 159,907 |
Models | Melody | Velocity | Composer | Emotion |
---|---|---|---|---|
RNN | 88.66 | 43.77 | 60.32 | 54.13 |
MidiBERT | 96.37 | 51.63 | 78.57 | 67.89 |
SIMbase | 98.22 | 54.42 | 84.68 | 78.31 |
SIMlarge | 99.06 | 56.75 | 88.19 | 82.62 |
Model | Melody Completion | Accompaniment Suggestion | ||||
---|---|---|---|---|---|---|
MAP | HITS@1 | HITS@5 | MAP | HITS@1 | HITS@5 | |
Tonnetz | 0.683 | 0.545 | 0.865 | 0.423 | 0.101 | 0.407 |
PiRhDy | 0.971 | 0.950 | 0.995 | 0.567 | 0.184 | 0.540 |
MusicBERT | 0.985 | 0.975 | 0.997 | 0.946 | 0.333 | 0.857 |
SIMbase | 0.991 | 0.988 | 1.000 | 0.963 | 0.536 | 0.912 |
SIMlarge | 0.997 | 0.999 | 1.000 | 0.985 | 0.667 | 0.977 |
Models | Melody | Velocity | Composer | Emotion |
---|---|---|---|---|
SIMbase w/o SI | 88.89 | 44.87 | 61.84 | 58.26 |
SIMbase w/o MBAM | 96.33 | 50.62 | 79.41 | 72.89 |
SIMbase w/o MSM | 94.98 | 48.66 | 77.36 | 70.64 |
SIMbase+REMI | 92.92 | 51.43 | 77.12 | 71.53 |
SIMbase (8-Patch Division) | 96.83 | 51.62 | 82.44 | 77.01 |
SIMbase (16-Patch Division) | 97.19 | 52.33 | 83.38 | 77.77 |
SIMbase (ours) | 98.22 | 54.42 | 84.68 | 78.31 |
Vision Encoder | Accuracy (Melody Extraction) | Computational Time per Batch (seconds) |
---|---|---|
ViT-S/16 | 96.83 | 0.23 |
ViT-B/16 | 98.22 | 0.48 |
ViT-L/16 | 98.28 | 1.06 |
ViT-H/16 | 98.33 | 1.98 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qin, Y.; Xie, H.; Ding, S.; Li, Y.; Tan, B.; Ye, M. Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training. Sensors 2024, 24, 5017. https://doi.org/10.3390/s24155017
Qin Y, Xie H, Ding S, Li Y, Tan B, Ye M. Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training. Sensors. 2024; 24(15):5017. https://doi.org/10.3390/s24155017
Chicago/Turabian StyleQin, Yang, Huiming Xie, Shuxue Ding, Yujie Li, Benying Tan, and Mingchuan Ye. 2024. "Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training" Sensors 24, no. 15: 5017. https://doi.org/10.3390/s24155017
APA StyleQin, Y., Xie, H., Ding, S., Li, Y., Tan, B., & Ye, M. (2024). Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training. Sensors, 24(15), 5017. https://doi.org/10.3390/s24155017