Author: "Benetos A" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Benetos A"' showing total 4,357 results

Start Over Author "Benetos A"

4,357 results on '"Benetos A"'

1. ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time Optimization

Author: Steinmetz, Christian J., Singh, Shubhr, Comunità, Marco, Ibnyahya, Ilias, Yuan, Shanxin, Benetos, Emmanouil, and Reiss, Joshua D.
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects., Comment: Accepted to ISMIR 2024. Code available https://github.com/csteinmetz1/st-ito
Published: 2024

2. GraFPrint: A GNN-Based Approach for Audio Identification

Author: Bhattacharjee, Aditya, Singh, Shubhr, and Benetos, Emmanouil
Subjects: Computer Science - Sound, Computer Science - Information Retrieval, Electrical Engineering and Systems Science - Audio and Speech Processing, H.5.5, I.2.6
Abstract: This paper introduces GraFPrint, an audio identification framework that leverages the structural learning capabilities of Graph Neural Networks (GNNs) to create robust audio fingerprints. Our method constructs a k-nearest neighbor (k-NN) graph from time-frequency representations and applies max-relative graph convolutions to encode local and global information. The network is trained using a self-supervised contrastive approach, which enhances resilience to ambient distortions by optimizing feature representation. GraFPrint demonstrates superior performance on large-scale datasets at various levels of granularity, proving to be both lightweight and scalable, making it suitable for real-world applications with extensive reference databases., Comment: Submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
Published: 2024

3. OmniBench: Towards The Future of Universal Omni-Language Models

Author: Li, Yizhi, Zhang, Ge, Ma, Yinghao, Yuan, Ruibin, Zhu, Kang, Guo, Hangyu, Liang, Yiming, Liu, Jiaheng, Wang, Zekun, Yang, Jian, Wu, Siwei, Qu, Xingwei, Shi, Jinjie, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Zhang, Zhaoxiang, Liu, Zachary, Benetos, Emmanouil, Huang, Wenhao, and Lin, Chenghua
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) most baselines models perform poorly (below 50\% accuracy) even when provided with alternative textual representations of images or/and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at https://m-a-p.ai/OmniBench.
Published: 2024

4. LC-Protonets: Multi-label Few-shot learning for world music audio tagging

Author: Papaioannou, Charilaos, Benetos, Emmanouil, and Potamianos, Alexandros
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification, where a model must generalize to new classes based on only a few available examples. Extending Prototypical Networks, LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items, rather than one prototype per label. Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music, and is evaluated against existing approaches in the literature. The results demonstrate a significant performance improvement in almost all domains and training setups when using LC-Protonets for multi-label classification. In addition to training a few-shot learning model from scratch, we explore the use of a pre-trained model, obtained via supervised learning, to embed items in the feature space. Fine-tuning improves the generalization ability of all methods, yet LC-Protonets achieve high-level performance even without fine-tuning, in contrast to the comparative approaches. We finally analyze the scalability of the proposed method, providing detailed quantitative metrics from our experiments. The implementation and experimental setup are made publicly available, offering a benchmark for future research.
Published: 2024

5. Acoustic identification of individual animals with hierarchical contrastive learning

Author: Nolasco, Ines, Moummad, Ilyass, Stowell, Dan, and Benetos, Emmanouil
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Acoustic identification of individual animals (AIID) is closely related to audio-based species classification but requires a finer level of detail to distinguish between individual animals within the same species. In this work, we frame AIID as a hierarchical multi-label classification task and propose the use of hierarchy-aware loss functions to learn robust representations of individual identities that maintain the hierarchical relationships among species and taxa. Our results demonstrate that hierarchical embeddings not only enhance identification accuracy at the individual level but also at higher taxonomic levels, effectively preserving the hierarchical structure in the learned representations. By comparing our approach with non-hierarchical models, we highlight the advantage of enforcing this structure in the embedding space. Additionally, we extend the evaluation to the classification of novel individual classes, demonstrating the potential of our method in open-set classification scenarios., Comment: Under review; Submitted to ICASSP 2025
Published: 2024

6. Domain-Invariant Representation Learning of Bird Sounds

Author: Moummad, Ilyass, Serizel, Romain, Benetos, Emmanouil, and Farrugia, Nicolas
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Passive acoustic monitoring (PAM) is crucial for bioacoustic research, enabling non-invasive species tracking and biodiversity monitoring. Citizen science platforms like Xeno-Canto provide large annotated datasets from focal recordings, where the target species is intentionally recorded. However, PAM requires monitoring in passive soundscapes, creating a domain shift between focal and passive recordings, which challenges deep learning models trained on focal recordings. To address this, we leverage supervised contrastive learning to improve domain generalization in bird sound classification, enforcing domain invariance across same-class examples from different domains. We also propose ProtoCLR (Prototypical Contrastive Learning of Representations), which reduces the computational complexity of the SupCon loss by comparing examples to class prototypes instead of pairwise comparisons. Additionally, we present a new few-shot classification evaluation based on BIRB, a large-scale bird sound benchmark to evaluate bioacoustic pre-trained models.
Published: 2024

7. Foundation Models for Music: A Survey

Author: Ma, Yinghao, Øland, Anders, Ragni, Anton, Del Sette, Bleiz MacSen, Saitis, Charalampos, Donahue, Chris, Lin, Chenghua, Plachouras, Christos, Benetos, Emmanouil, Shatri, Elona, Morreale, Fabio, Zhang, Ge, Fazekas, György, Xia, Gus, Zhang, Huan, Manco, Ilaria, Huang, Jiawen, Guinot, Julien, Lin, Liwei, Marinelli, Luca, Lam, Max W. Y., Sharma, Megha, Kong, Qiuqiang, Dannenberg, Roger B., Yuan, Ruibin, Wu, Shangda, Wu, Shih-Lun, Dai, Shuqi, Lei, Shun, Kang, Shiyin, Dixon, Simon, Chen, Wenhu, Huang, Wenhao, Du, Xingjian, Qu, Xingwei, Tan, Xu, Li, Yizhi, Tian, Zeyue, Wu, Zhiyong, Wu, Zhizheng, Ma, Ziyang, and Wang, Ziyu
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
Published: 2024

8. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Author: Weck, Benno, Manco, Ilaria, Benetos, Emmanouil, Quinton, Elio, Fazekas, George, and Bogdanov, Dmitry
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced., Comment: Accepted at ISMIR 2024. Data: https://doi.org/10.5281/zenodo.12709974 Code: https://github.com/mulab-mir/muchomusic Supplementary material: https://mulab-mir.github.io/muchomusic
Published: 2024

9. Can LLMs 'Reason' in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Author: Zhou, Ziya, Wu, Yuhang, Wu, Zhiyue, Zhang, Xinyue, Yuan, Ruibin, Ma, Yinghao, Wang, Lu, Benetos, Emmanouil, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step reasoning perspective, which is a critical aspect in the conditioned, editable, and interactive human-computer co-creation process. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing. We identify that current LLMs exhibit poor performance in song-level multi-step music reasoning, and typically fail to leverage learned music knowledge when addressing complex musical tasks. An analysis of LLMs' responses highlights distinctly their pros and cons. Our findings suggest achieving advanced musical capability is not intrinsically obtained by LLMs, and future research should focus more on bridging the gap between music knowledge and reasoning, to improve the co-creation experience for musicians., Comment: Accepted by ISMIR2024
Published: 2024

10. Stochastic branching models for the telomeres dynamics in a model including telomerase activity

Author: Benetos, Athanase, Fritsch, Coralie, Horton, Emma, Lenotre, Lionel, Toupance, Simon, and Villemonais, Denis
Subjects: Quantitative Biology - Cell Behavior, Mathematics - Probability, Quantitative Biology - Populations and Evolution
Abstract: Telomeres are repetitive sequences of nucleotides at the end of chromosomes, whose evolution over time is intrinsically related to biological ageing. In most cells, with each cell division, telomeres shorten due to the so-called end replication problem, which can lead to replicative senescence and a variety of age-related diseases. On the other hand, in certain cells, the presence of the enzyme telomerase can lead to the lengthening of telomeres, which may delay or prevent the onset of such diseases but can also increase the risk of cancer.In this article, we propose a stochastic representation of this biological model, which takes into account multiple chromosomes per cell, the effect of telomerase, different cell types and the dependence of the distribution of telomere length on the dynamics of the process. We study theoretical properties of this model, including its long-term behaviour. In addition, we investigate numerically the impact of the model parameters on biologically relevant quantities, such as the Hayflick limit and the Malthusian parameter of the population of cells.
Published: 2024

11. YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Author: Chang, Sungkyun, Benetos, Emmanouil, Kirchhoff, Holger, and Dixon, Simon
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available with demos at \url{https://github.com/mimbres/YourMT3}., Comment: Accepted at IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2024, London
Published: 2024

12. Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model

Author: Huang, Jiawen and Benetos, Emmanouil
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Multilingual automatic lyrics transcription (ALT) is a challenging task due to the limited availability of labelled data and the challenges introduced by singing, compared to multilingual automatic speech recognition. Although some multilingual singing datasets have been released recently, English continues to dominate these collections. Multilingual ALT remains underexplored due to the scale of data and annotation quality. In this paper, we aim to create a multilingual ALT system with available datasets. Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario by expanding the target vocabulary set. We then evaluate the performance of the multilingual model in comparison to its monolingual counterparts. Additionally, we explore various conditioning methods to incorporate language information into the model. We apply analysis by language and combine it with the language classification performance. Our findings reveal that the multilingual model performs consistently better than the monolingual models trained on the language subsets. Furthermore, we demonstrate that incorporating language information significantly enhances performance., Comment: Accepted at EUSIPCO 2024
Published: 2024

13. MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Author: Zhang, Ge, Qu, Scott, Liu, Jiaheng, Zhang, Chenchen, Lin, Chenghua, Yu, Chou Leuang, Pan, Danny, Cheng, Esther, Liu, Jie, Lin, Qunshu, Yuan, Raven, Zheng, Tuney, Pang, Wei, Du, Xinrun, Liang, Yiming, Ma, Yinghao, Li, Yizhi, Ma, Ziyang, Lin, Bill, Benetos, Emmanouil, Yang, Huan, Zhou, Junting, Ma, Kaijing, Liu, Minghao, Niu, Morry, Wang, Noah, Que, Quehry, Liu, Ruibo, Liu, Sine, Guo, Shawn, Gao, Soren, Zhou, Wangchunshu, Zhang, Xinyue, Zhou, Yizhi, Wang, Yubo, Bai, Yuelin, Zhang, Yuhan, Zhang, Yuxiang, Wang, Zenith, Yang, Zhenzhu, Zhao, Zijian, Zhang, Jiajun, Ouyang, Wanli, Huang, Wenhao, and Chen, Wenhu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs., Comment: https://map-neo.github.io/
Published: 2024

14. Explaining models relating objects and privacy

Author: Xompero, Alessio, Bontonou, Myriam, Arbona, Jean-Michel, Benetos, Emmanouil, and Cavallaro, Andrea
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurately predicting whether an image is private before sharing it online is difficult due to the vast variety of content and the subjective nature of privacy itself. In this paper, we evaluate privacy models that use objects extracted from an image to determine why the image is predicted as private. To explain the decision of these models, we use feature-attribution to identify and quantify which objects (and which of their features) are more relevant to privacy classification with respect to a reference input (i.e., no objects localised in an image) predicted as public. We show that the presence of the person category and its cardinality is the main factor for the privacy decision. Therefore, these models mostly fail to identify private images depicting documents with sensitive data, vehicle ownership, and internet activity, or public images with people (e.g., an outdoor concert or people walking in a public space next to a famous landmark). As baselines for future benchmarks, we also devise two strategies that are based on the person presence and cardinality and achieve comparable classification performance of the privacy models., Comment: 7 pages, 3 figures, 1 table, supplementary material included as Appendix. Paper accepted at the 3rd XAI4CV Workshop at CVPR 2024. Code: https://github.com/graphnex/ig-privacy
Published: 2024

15. ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Author: Deng, Qixin, Yang, Qikai, Yuan, Ruibin, Huang, Yipeng, Wang, Yi, Liu, Xubo, Tian, Zeyue, Pan, Jiahao, Zhang, Ge, Lin, Hanfeng, Li, Yizhi, Ma, Yinghao, Fu, Jie, Lin, Chenghua, Benetos, Emmanouil, Wang, Wenwu, Xia, Guangyu, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and Chain-of-Thoughts. To further explore and enhance LLMs' potential in music composition by leveraging their reasoning ability and the large knowledge base in music history and theory, we propose ComposerX, an agent-based symbolic music generation framework. We find that applying a multi-agent approach significantly improves the music composition quality of GPT-4. The results demonstrate that ComposerX is capable of producing coherent polyphonic music compositions with captivating melodies, while adhering to user instructions.
Published: 2024

16. MuPT: A Generative Symbolic Music Pretrained Transformer

Author: Qu, Xingwei, Bai, Yuelin, Ma, Yinghao, Zhou, Ziya, Lo, Ka Man, Liu, Jiaheng, Yuan, Ruibin, Min, Lejun, Liu, Xueling, Zhang, Tianyu, Du, Xinrun, Guo, Shuyue, Liang, Yiming, Li, Yizhi, Wu, Shangda, Zhou, Junting, Zheng, Tianyu, Ma, Ziyang, Han, Fengze, Xue, Wei, Xia, Gus, Benetos, Emmanouil, Yue, Xiang, Lin, Chenghua, Tan, Xu, Huang, Stephen W., Fu, Jie, and Zhang, Ge
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
Published: 2024

17. Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection

Author: Liang, Jinhua, Nolasco, Ines, Ghani, Burooj, Phan, Huy, Benetos, Emmanouil, and Stowell, Dan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures and data augmentation techniques to enhance model performance. However, these approaches have not fully bridged the domain gap between source and target distributions, limiting their applicability in real-world scenarios. In this work, we introduce an new dataset designed to augment the diversity and breadth of classes available for few-shot bioacoustic event detection, building on the foundations of our previous datasets. To establish a robust baseline system tailored for the DCASE 2024 Task 5 challenge, we delve into an array of acoustic features and adopt negative hard sampling as our primary domain adaptation strategy. This approach, chosen in alignment with the challenge's guidelines that necessitate the independent treatment of each audio file, sidesteps the use of transductive learning to ensure compliance while aiming to enhance the system's adaptability to domain shifts. Our experiments show that the proposed baseline system achieves a better performance compared with the vanilla prototypical network. The findings also confirm the effectiveness of each domain adaptation method by ablating different components within the networks. This highlights the potential to improve few-shot bioacoustic sound event detection by further reducing the impact of domain shift.
Published: 2024

18. Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models

Author: Postolache, Emilian, Mariani, Giorgio, Cosmo, Luca, Benetos, Emmanouil, and Rodolà, Emanuele
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks: generating a set of coherent sources, creating accompaniments, and performing source separation. Despite their versatility, they require estimating the joint distribution over the sources, necessitating pre-separated musical data, which is rarely available, and fixing the number and type of sources at training time. This paper generalizes MSDM to arbitrary time-domain diffusion models conditioned on text embeddings. These models do not require separated data as they are trained on mixtures, can parameterize an arbitrary number of sources, and allow for rich semantic control. We propose an inference procedure enabling the coherent generation of sources and accompaniments. Additionally, we adapt the Dirac separator of MSDM to perform source separation. We experiment with diffusion models trained on Slakh2100 and MTG-Jamendo, showcasing competitive generation and separation results in a relaxed data setting., Comment: Accepted at ICASSP 2024
Published: 2024

19. WavCraft: Audio Editing and Generation with Large Language Models

Author: Liang, Jinhua, Zhang, Huan, Liu, Haohe, Cao, Yin, Kong, Qiuqiang, Liu, Xubo, Wang, Wenwu, Plumbley, Mark D., Phan, Huy, and Benetos, Emmanouil
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decomposes users' instructions into several tasks and tackle each task collaboratively with the particular module. Through task decomposition along with a set of task-specific models, WavCraft follows the input instruction to create or edit audio content with more details and rationales, facilitating user control. In addition, WavCraft is able to cooperate with users via dialogue interaction and even produce the audio content without explicit user commands. Experiments demonstrate that WavCraft yields a better performance than existing methods, especially when adjusting the local regions of audio clips. Moreover, WavCraft can follow complex instructions to edit and create audio content on the top of input recordings, facilitating audio producers in a broader range of applications. Our implementation and demos are available at this https://github.com/JinhuaLiang/WavCraft.
Published: 2024

20. Improvement of decision-making criteria for the care of elderly cancer patients by general practitioners (Lorraine, France)

Author: Niemier JY, Claudot F, Nguyen-Thi PL, Hubert JM, Rousselot H, Benetos A, and Perret-Guillaume C
Subjects: Elderly – Cancer – General practioner – Treatment Decision making – Care improvement, Geriatrics, RC952-954.6
Abstract: Jean-Yves Niemier,1,2 Frédérique Claudot,3,4 Phi Linh Nguyen-Thi,4 Jean-Marie Hubert,5 Hubert Rousselot,2,6 Athanase Benetos,1 Christine Perret-Guillaume1,3 1Department of Geriatric Medicine, CHRU de Nancy, Nancy, France; 2UCOG Lorraine, Nancy, France; 3EA 4360 APEMAC, Faculté de Médecine, Université de Lorraine, Nancy, France; 4PARC, CHRU de Nancy, Nancy, France; 5Spincourt Multidisciplinary MSP, Spincourt, France; 6SISSPO Department, Institut de Cancérologie de Lorraine, Vandœuvre-lès-Nancy, France Objective: The objective of this study was to identify changes in the decision-making criteria of general practitioners (GPs) concerning the care of elderly cancer patients after 1 year of corrective measures for care practices in the Lorraine region, France. Materials and methods: In 2014, a postal mail questionnaire was sent to all GPs in the Lorraine region. This questionnaire was designed to identify GPs’ decision-making criteria. It was based on the results of a literature review and on existing guidelines. During 1 year, corrective measures were implemented to improve practices, especially training sessions for physicians and production of specific tools, including a guide to the accepted ideas in geriatric oncology. In 2015, the same questionnaire was resent to all GPs to compare the answers. Results: In 2014, 430 questionnaires were returned out of 2,048 sent, and in 2015, 378 questionnaires were returned out of 2,066 sent. Our results show for the first time that there exists a significant difference in the overall decision criteria between the two survey periods. This difference mainly concerns criteria related to the cancerous diseases. Physicians tend to consider the principal decision criteria to be less important after the training period. GPs express the importance of accessibility to specialists for additional advice in both 2014 and 2015; the distance between the patient’s home and an adapted care facility and the interval before care begins are viewed as similarly important. Conclusion: Training and information sessions for physicians remain the most important tool for improving care practices. Such training strategies are more effective when carried out at the geographical scale at which the cancer professionals practice, allowing them to exploit their local organizational structure. The analysis of our data makes it possible to further integrate the patient into the care path, which remains a public health issue in terms of cost and organization. Keywords: elderly, cancer, general practitioner, treatment decision-making, care improvement, older people, tumors, physician, management, ethics
Published: 2018

21. ChatMusician: Understanding and Generating Music Intrinsically with LLM

Author: Yuan, Ruibin, Lin, Hanfeng, Wang, Yi, Tian, Zeyue, Wu, Shangda, Shen, Tianhao, Zhang, Ge, Wu, Yuhang, Liu, Cong, Zhou, Ziya, Ma, Ziyang, Xue, Liumeng, Wang, Ziyu, Liu, Qin, Zheng, Tianyu, Li, Yizhi, Ma, Yinghao, Liang, Yiming, Chi, Xiaowei, Liu, Ruibo, Wang, Zili, Li, Pengfei, Wu, Jingcheng, Lin, Chenghua, Liu, Qifeng, Jiang, Tao, Huang, Wenhao, Chen, Wenhu, Benetos, Emmanouil, Fu, Jie, Xia, Gus, Dannenberg, Roger, Xue, Wei, Kang, Shiyin, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub., Comment: GitHub: https://shanghaicannon.github.io/ChatMusician/
Published: 2024

22. A Data-Driven Analysis of Robust Automatic Piano Transcription

Author: Edwards, Drew, Dixon, Simon, Benetos, Emmanouil, Maezawa, Akira, and Kusaka, Yuta
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Algorithms for automatic piano transcription have improved dramatically in recent years due to new datasets and modeling techniques. Recent developments have focused primarily on adapting new neural network architectures, such as the Transformer and Perceiver, in order to yield more accurate systems. In this work, we study transcription systems from the perspective of their training data. By measuring their performance on out-of-distribution annotated piano data, we show how these models can severely overfit to acoustic properties of the training data. We create a new set of audio for the MAESTRO dataset, captured automatically in a professional studio recording environment via Yamaha Disklavier playback. Using various data augmentation techniques when training with the original and re-performed versions of the MAESTRO dataset, we achieve state-of-the-art note-onset accuracy of 88.4 F1-score on the MAPS dataset, without seeing any of its training data. We subsequently analyze these data augmentation techniques in a series of ablation studies to better understand their influence on the resulting models., Comment: Accepted for publication in IEEE Signal Processing Letters on 31 Janurary, 2024
Published: 2024

23. P7 TELOMERE LENGTH AND AORTIC VALVE CALCIFICATION

Author: Ilona Saraieva, Simon Toupance, Magnus Back, and Benetos Athanase
Subjects: Specialties of internal medicine, RC581-951, Diseases of the circulatory (Cardiovascular) system, RC666-701
Abstract: Background: Short telomere length (TL) is associated with atherosclerosis development. Aortic valve stenosis, an age-related disease characterized by narrowing of the aortic opening, is mainly caused by aortic valve calcification. Development of aortic calcifications shares many similarities with atherogenesis, we thus hypothesize that people with short TL may have higher risk to develop aortic valve stenosis. Methods: Aortic valves were obtained from 11 patients undergoing valve replacement surgery. Each valve cusp was macroscopically dissected into healthy, intermediate and calcified regions. DNA was extracted by phenol/chloroform method and TL measured by Southern blots of the terminal restriction fragments. Results: TL from healthy and intermediate valve regions were similar and then merged in a non-calcified group. In all subjects, TL of calcified regions were shorter than TL in non-calcified regions. The gap between TL of non-calcified and calcified regions was 0,53kb (p
Published: 2018
Full Text: View/download PDF

24. Geriatricians’ role in the management of aortic stenosis in frail older patients: a decade later

Author: Ungar, Andrea, Rivasi, Giulia, Testa, Giuseppe Dario, Boureau, Anne Sophie, Mattace-Raso, Francesco, Martínez-Sellés, Manuel, Bo, Mario, Petrovic, Mirko, Werner, Nikos, and Benetos, Athanase
Published: 2024
Full Text: View/download PDF

25. Aorta Segmentation in 3D CT Images by Combining Image Processing and Machine Learning Techniques

Author: Mavridis, Christos, Economopoulos, Theodore L., Benetos, Georgios, and Matsopoulos, George K.
Published: 2024
Full Text: View/download PDF

26. Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Author: Liang, Jinhua, Liu, Xubo, Wang, Wenwu, Plumbley, Mark D., Phan, Huy, and Benetos, Emmanouil
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at https://github.com/JinhuaLiang/APT.
Published: 2023

27. A branching model for intergenerational telomere length dynamics

Author: Benetos, Athanasios, Coudray, Olivier, Gégout-Petit, Anne, Lenôtre, Lionel, Toupance, Simon, and Villemonais, Denis
Subjects: Quantitative Biology - Populations and Evolution, Mathematics - Probability
Abstract: We build and study an individual based model of the telomere length's evolution in a population across multiple generations. This model is a continuous time typed branching process, where the type of an individual includes its gamete mean telomere length and its age. We study its Malthusian's behaviour and provide numerical simulations to understand the influence of biologically relevant parameters.
Published: 2023

28. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Author: Manco, Ilaria, Weck, Benno, Doh, SeungHeon, Won, Minz, Zhang, Yixiao, Bogdanov, Dmitry, Wu, Yusong, Chen, Ke, Tovstogan, Philip, Benetos, Emmanouil, Quinton, Elio, Fazekas, György, and Nam, Juhan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance., Comment: Accepted to NeurIPS 2023 Workshop on Machine Learning for Audio
Published: 2023

29. ATGNN: Audio Tagging Graph Neural Network

Author: Singh, Shubhr, Steinmetz, Christian J., Benetos, Emmanouil, Phan, Huy, and Stowell, Dan
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this work, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters.
Published: 2023

30. MERTech: Instrument Playing Technique Detection Using Self-Supervised Pretrained Model With Multi-Task Finetuning

Author: Li, Dichucheng, Ma, Yinghao, Wei, Weixing, Kong, Qiuqiang, Wu, Yulun, Che, Mingjin, Xia, Fan, Benetos, Emmanouil, and Li, Wei
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Instrument playing techniques (IPTs) constitute a pivotal component of musical expression. However, the development of automatic IPT detection methods suffers from limited labeled data and inherent class imbalance issues. In this paper, we propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks. This approach addresses data scarcity and class imbalance challenges. Recognizing the significance of pitch in capturing the nuances of IPTs and the importance of onset in locating IPT events, we investigate multi-task finetuning with pitch and onset detection as auxiliary tasks. Additionally, we apply a post-processing approach for event-level prediction, where an IPT activation initiates an event only if the onset output confirms an onset in that frame. Our method outperforms prior approaches in both frame-level and event-level metrics across multiple IPT benchmark datasets. Further experiments demonstrate the efficacy of multi-task finetuning on each IPT class., Comment: submitted to ICASSP 2024
Published: 2023

31. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Author: Deng, Zihao, Ma, Yinghao, Liu, Yudong, Guo, Rongchen, Zhang, Ge, Chen, Wenhu, Huang, Wenhao, and Benetos, Emmanouil
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia, Computer Science - Sound
Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.
Published: 2023

32. From West to East: Who can understand the music of the others better?

Author: Papaioannou, Charilaos, Benetos, Emmanouil, and Potamianos, Alexandros
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent developments in MIR have led to several benchmark deep learning models whose embeddings can be used for a variety of downstream tasks. At the same time, the vast majority of these models have been trained on Western pop/rock music and related styles. This leads to research questions on whether these models can be used to learn representations for different music cultures and styles, or whether we can build similar music audio embedding models trained on data from different cultures or styles. To that end, we leverage transfer learning methods to derive insights about the similarities between the different music cultures to which the data belongs to. We use two Western music datasets, two traditional/folk datasets coming from eastern Mediterranean cultures, and two datasets belonging to Indian art music. Three deep audio embedding models are trained and transferred across domains, including two CNN-based and a Transformer-based architecture, to perform auto-tagging for each target domain dataset. Experimental results show that competitive performance is achieved in all domains via transfer learning, while the best source dataset varies for each music culture. The implementation and the trained models are both provided in a public repository.
Published: 2023

33. On the Effectiveness of Speech Self-supervised Learning for Music

Author: Ma, Yinghao, Yuan, Ruibin, Li, Yizhi, Zhang, Ge, Chen, Xingran, Yin, Hanzhi, Lin, Chenghua, Benetos, Emmanouil, Ragni, Anton, Gyenge, Norbert, Liu, Ruibo, Xia, Gus, Dannenberg, Roger, Guo, Yike, and Fu, Jie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, and refer to them as music2vec and musicHuBERT, respectively. We train $12$ SSL models with 95M parameters under various pre-training configurations and systematically evaluate the MIR task performances with 13 different MIR tasks. Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech. However, we identify the limitations of such existing speech-oriented designs, especially in modelling polyphonic information. Based on the experimental results, empirical suggestions are also given for designing future musical SSL strategies and paradigms.
Published: 2023

34. LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

Author: Zhuo, Le, Yuan, Ruibin, Pan, Jiahao, Ma, Yinghao, LI, Yizhi, Zhang, Ge, Liu, Si, Dannenberg, Roger, Fu, Jie, Lin, Chenghua, Benetos, Emmanouil, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task., Comment: 9 pages, 2 figures, 5 tables, accepted by ISMIR 2023
Published: 2023

35. MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Author: Yuan, Ruibin, Ma, Yinghao, Li, Yizhi, Zhang, Ge, Chen, Xingran, Yin, Hanzhi, Zhuo, Le, Liu, Yiqi, Huang, Jiawen, Tian, Zeyue, Deng, Binyue, Wang, Ningzhi, Lin, Chenghua, Benetos, Emmanouil, Ragni, Anton, Gyenge, Norbert, Dannenberg, Roger, Chen, Wenhu, Xia, Gus, Xue, Wei, Liu, Si, Wang, Shi, Liu, Ruibo, Guo, Yike, and Fu, Jie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research., Comment: camera-ready version for NeurIPS 2023
Published: 2023

36. Strategies for Identifying Patients for Deprescribing of Blood Pressure Medications in Routine Practice: An Evidence Review

Author: Sheppard, James P., Benetos, Athanase, Bogaerts, Jonathan, Gnjidic, Danijela, and McManus, Richard J.
Published: 2024
Full Text: View/download PDF

37. MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Author: Li, Yizhi, Yuan, Ruibin, Zhang, Ge, Ma, Yinghao, Chen, Xingran, Yin, Hanzhi, Xiao, Chenghao, Lin, Chenghua, Ragni, Anton, Benetos, Emmanouil, Gyenge, Norbert, Dannenberg, Roger, Liu, Ruibo, Chen, Wenhu, Xia, Gus, Shi, Yemin, Huang, Wenhao, Wang, Zili, Guo, Yike, and Fu, Jie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores., Comment: accepted by ICLR 2024
Published: 2023

38. Few-shot Class-incremental Audio Classification Using Dynamically Expanded Classifier with Self-attention Modified Prototypes

Author: Li, Yanxiong, Cao, Wenchang, Xie, Wei, Li, Jialong, and Benetos, Emmanouil
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most existing methods for audio classification assume that the vocabulary of audio classes to be classified is fixed. When novel (unseen) audio classes appear, audio classification systems need to be retrained with abundant labeled samples of all audio classes for recognizing base (initial) and novel audio classes. If novel audio classes continue to appear, the existing methods for audio classification will be inefficient and even infeasible. In this work, we propose a method for few-shot class-incremental audio classification, which can continually recognize novel audio classes without forgetting old ones. The framework of our method mainly consists of two parts: an embedding extractor and a classifier, and their constructions are decoupled. The embedding extractor is the backbone of a ResNet based network, which is frozen after construction by a training strategy using only samples of base audio classes. However, the classifier consisting of prototypes is expanded by a prototype adaptation network with few samples of novel audio classes in incremental sessions. Labeled support samples and unlabeled query samples are used to train the prototype adaptation network and update the classifier, since they are informative for audio classification. Three audio datasets, named NSynth-100, FSC-89 and LS-100 are built by choosing samples from audio corpora of NSynth, FSD-MIX-CLIP and LibriSpeech, respectively. Results show that our method exceeds baseline methods in average accuracy and performance dropping rate. In addition, it is competitive compared to baseline methods in computational complexity and memory requirement. The code for our method is given at https://github.com/vinceasvp/FCAC., Comment: 13 pages, 8 figures, 12 tables. Accepted for publication in IEEE TMM
Published: 2023
Full Text: View/download PDF

39. Adapting Language-Audio Models as Few-Shot Audio Learners

Author: Liang, Jinhua, Liu, Xubo, Liu, Haohe, Phan, Huy, Benetos, Emmanouil, Plumbley, Mark D., and Wang, Wenwu
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-free version of the Treff adapter by using CALM as a cosine similarity measure. Experiments showed that the proposed Treff adapter is comparable and even better than fully-supervised methods and adaptation methods in low-shot and data-abundant scenarios. While the Treff adapter shows that combining large-scale pretraining and rapid learning of domain-specific knowledge is non-trivial for obtaining generic representations for few-shot learning, it is still limited to audio classification tasks. In the future, we will explore how to use audio-language models in diverse audio domains.
Published: 2023

40. Visceral obesity is not an independent risk factor of mortality in subjects over 65 years

Author: Thomas F, Pannier B, Benetos A, and Vischer UM
Subjects: Diseases of the circulatory (Cardiovascular) system, RC666-701
Abstract: Frédérique Thomas,1 Bruno Pannier,1 Athanase Benetos,2 Ulrich M Vischer3,† 1Centre d'Investigations Préventives et Cliniques, Paris, France; 2Department of Geriatrics, Nancy University Hospital Center, University of Lorraine, Nancy, France; 3Geneva University Hospitals, Department of Rehabilitation and Geriatrics, Geneva, Switzerland †Ulrich M Vischer passed away on March 19, 2012 Abstract: The aim of the study was to determine the role of obesity evaluated by body mass index (BMI), waist circumference (WC), and their combined effect on all-cause mortality according to age and related risk factors. This study included 119,090 subjects (79,325 men and 39,765 women), aged from 17 years to 85 years, who had a general health checkup at the Centre d'Investigations Préventives et Cliniques, Paris, France. The mean follow-up was 5.6±2.4 years. The prevalence of obesity, defined by WC and BMI categories, was determined according to age groups (65 years). All-cause mortality according to obesity and age was determined using Cox regression analysis, adjusted for related risk factors and previous cardiovascular events. For the entire population, WC adjusted for BMI, an index of central obesity, was strongly associated with mortality, even after adjustment for hypertension, dyslipidemia, and diabetes. The prevalence of obesity increased with age, notably when defined by WC. Nonetheless, the association between WC adjusted for BMI and mortality was not observed in subjects >65 years old (hazard ratio [HR] =1.010, P=NS) but was found in subjects 65 years of age, suggesting a differential impact of visceral fat deposition according to age. Keywords: abdominal, aging, body mass index, hypertension, smoking
Published: 2013

41. Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition

Author: Liang, Jinhua, Phan, Huy, and Benetos, Emmanouil
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Everyday sound recognition aims to infer types of sound events in audio streams. While many works succeeded in training models with high performance in a fully-supervised manner, they are still restricted to the demand of large quantities of labelled data and the range of predefined classes. To overcome these drawbacks, this work firstly curates a new database named FSD-FS for multi-label few-shot audio classification. It then explores how to incorporate audio taxonomy in few-shot learning. Specifically, this work proposes label-dependent prototypical networks (LaD-protonet) to exploit parent-children relationships between labels. Plus, it applies taxonomy-aware label smoothing techniques to boost model performance. Experiments demonstrate that LaD-protonet outperforms original prototypical networks as well as other state-of-the-art methods. Moreover, its performance can be further boosted when combined with taxonomy-aware label smoothing., Comment: submitted to ICASSP2023
Published: 2022

42. MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning

Author: Li, Yizhi, Yuan, Ruibin, Zhang, Ge, Ma, Yinghao, Lin, Chenghua, Chen, Xingran, Ragni, Anton, Yin, Hanzhi, Hu, Zhijie, He, Haoyu, Benetos, Emmanouil, Gyenge, Norbert, Liu, Ruibo, and Fu, Jie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter. The model will be released on Huggingface(Please refer to: https://huggingface.co/m-a-p/music2vec-v1)
Published: 2022

43. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response.

Author: Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos
Published: 2024
Full Text: View/download PDF

44. ChatMusician: Understanding and Generating Music Intrinsically with LLM.

Author: Ruibin Yuan, Hanfeng Lin, Yi Wang 0033, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, Liumeng Xue, Ziyang Ma, Qin Liu, Tianyu Zheng, Yizhi Li, Yinghao Ma, Yiming Liang, Xiaowei Chi, Ruibo Liu, Zili Wang, Chenghua Lin, Qifeng Liu, Tao Jiang, Wenhao Huang, Wenhu Chen, Jie Fu 0001, Emmanouil Benetos, Gus Xia, Roger B. Dannenberg, Wei Xue, Shiyin Kang, and Yike Guo
Published: 2024
Full Text: View/download PDF

45. Learning from Taxonomy: Multi-Label Few-Shot Classification for Everyday Sound Recognition.

Author: Jinhua Liang, Huy Phan, and Emmanouil Benetos
Published: 2024
Full Text: View/download PDF

46. Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models.

Author: Emilian Postolache, Giorgio Mariani, Luca Cosmo, Emmanouil Benetos, and Emanuele Rodolà
Published: 2024
Full Text: View/download PDF

47. Mertech: Instrument Playing Technique Detection Using Self-Supervised Pretrained Model with Multi-Task Finetuning.

Author: Dichucheng Li, Yinghao Ma, Weixing Wei, Qiuqiang Kong, Yulun Wu, Mingjin Che, Fan Xia, Emmanouil Benetos, and Wei Li 0012
Published: 2024
Full Text: View/download PDF

48. Design and Implementation of a 3D Printed Robotic Vision System Connected to an Ontology-Based Editor for Manuscript Transcription and Annotation

Author: Sigalas, John, Skarpetis, Michael G., Koumboulis, Fotis N., Benetos, Dionysios, Kouvakas, Nikolaos D., Markopoulos, Georgios, Papadaki, Anna, Chakravorty, Antorweep, Series Editor, Verma, Ajit Kumar, Series Editor, Bhattacharya, Pushpak, Series Editor, Pant, Millie, Series Editor, Ghosh, Shubha, Series Editor, Farmanbar, Mina, editor, and Tzamtzi, Maria, editor
Published: 2024
Full Text: View/download PDF

49. Learning Music Representations with wav2vec 2.0

Author: Ragano, Alessandro, Benetos, Emmanouil, and Hines, Andrew
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Learning music representations that are general-purpose offers the flexibility to finetune several downstream tasks using smaller datasets. The wav2vec 2.0 speech representation model showed promising results in many downstream speech tasks, but has been less effective when adapted to music. In this paper, we evaluate whether pre-training wav2vec 2.0 directly on music data can be a better solution instead of finetuning the speech model. We illustrate that when pre-training on music data, the discrete latent representations are able to encode the semantic meaning of musical concepts such as pitch and instrument. Our results show that finetuning wav2vec 2.0 pre-trained on music data allows us to achieve promising results on music classification tasks that are competitive with prior work on audio representations. In addition, the results are superior to the pre-trained model on speech embeddings, demonstrating that wav2vec 2.0 pre-trained on music data can be a promising music representation model., Comment: Submitted to ICASSP 2023
Published: 2022

50. Contrastive Audio-Language Learning for Music

Author: Manco, Ilaria, Benetos, Emmanouil, Quinton, Elio, and Fazekas, György
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application-focused fields like Music Information Retrieval. In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain. To this end, we propose MusCALL, a framework for Music Contrastive Audio-Language Learning. Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio-to-text retrieval out-of-the-box. Thanks to this property, MusCALL can be transferred to virtually any task that can be cast as text-based retrieval. Our experiments show that our method performs significantly better than the baselines at retrieving audio that matches a textual description and, conversely, text that matches an audio query. We also demonstrate that the multimodal alignment capability of our model can be successfully extended to the zero-shot transfer scenario for genre classification and auto-tagging on two public datasets., Comment: Accepted to ISMIR 2022
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

4,357 results on '"Benetos A"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources