Author: "Esling, Philippe" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Esling, Philippe"' showing total 156 results

Start Over Author "Esling, Philippe"

156 results on '"Esling, Philippe"'

1. Dynamic Computer-Aided Orchestration in Practice with Orchidea

Author: Cella, Carmine-Emanuele, Ghisi, Daniele, Maresz, Yan, Petrolati, Alessandro, Teiller, Alexandre, and Esling, Philippe
Published: 2023

2. Unsupervised Composable Representations for Audio

Author: Bindi, Giovanni and Esling, Philippe
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Current generative models are able to generate high-quality artefacts but have been shown to struggle with compositional reasoning, which can be defined as the ability to generate complex structures from simpler elements. In this paper, we focus on the problem of compositional representation learning for music data, specifically targeting the fully-unsupervised setting. We propose a simple and extensible framework that leverages an explicit compositional inductive bias, defined by a flexible auto-encoding objective that can leverage any of the current state-of-art generative models. We demonstrate that our framework, used with diffusion models, naturally addresses the task of unsupervised audio source separation, showing that our model is able to perform high-quality separation. Our findings reveal that our proposal achieves comparable or superior performance with respect to other blind source separation methods and, furthermore, it even surpasses current state-of-art supervised baselines on signal-to-interference ratio metrics. Additionally, by learning an a-posteriori masking diffusion model in the space of composable representations, we achieve a system capable of seamlessly performing unsupervised source separation, unconditional generation, and variation generation. Finally, as our proposal works in the latent space of pre-trained neural audio codecs, it also provides a lower computational cost with respect to other neural baselines., Comment: ISMIR 2024
Published: 2024

3. Combining audio control and style transfer using latent diffusion

Author: Demerlé, Nils, Esling, Philippe, Doras, Guillaume, and Genova, David
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Deep generative models are now able to synthesize high-quality audio signals, shifting the critical aspect in their development from audio quality to control capabilities. Although text-to-music generation is getting largely adopted by the general public, explicit control and example-based style transfer are more adequate modalities to capture the intents of artists and musicians. In this paper, we aim to unify explicit control and style transfer within a single model by separating local and global information to capture musical structure and timbre respectively. To do so, we leverage the capabilities of diffusion autoencoders to extract semantic features, in order to build two representation spaces. We enforce disentanglement between those spaces using an adversarial criterion and a two-stage training strategy. Our resulting model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We evaluate our model on one-shot timbre transfer and MIDI-to-audio tasks on instrumental recordings and show that we outperform existing baselines in terms of audio quality and target fidelity. Furthermore, we show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre., Comment: ISMIR 2024
Published: 2024

4. Continuous descriptor-based control for deep audio synthesis

Author: Devis, Ninon, Demerlé, Nils, Nabi, Sarah, Genova, David, and Esling, Philippe
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Despite significant advances in deep models for music generation, the use of these techniques remains restricted to expert users. Before being democratized among musicians, generative models must first provide expressive control over the generation, as this conditions the integration of deep generative models in creative workflows. In this paper, we tackle this issue by introducing a deep generative audio model providing expressive and continuous descriptor-based control, while remaining lightweight enough to be embedded in a hardware synthesizer. We enforce the controllability of real-time generation by explicitly removing salient musical features in the latent space using an adversarial confusion criterion. User-specified features are then reintroduced as additional conditioning information, allowing for continuous control of the generation, akin to a synthesizer knob. We assess the performance of our method on a wide variety of sounds including instrumental, percussive and speech recordings while providing both timbre and attributes transfer, allowing new ways of generating sounds., Comment: ICASSP 2023
Published: 2023

5. SingSong: Generating musical accompaniments from singing

Author: Donahue, Chris, Caillon, Antoine, Roberts, Adam, Manilow, Ethan, Esling, Philippe, Agostinelli, Andrea, Verzetti, Mauro, Simon, Ian, Pietquin, Olivier, Zeghidour, Neil, and Engel, Jesse
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM (Borsos et al., 2022) -- a state-of-the-art approach for unconditional audio generation -- to be suitable for conditional "audio-to-audio" generation tasks, and train it on the source-separated (vocal, instrumental) pairs. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline. Sound examples at https://g.co/magenta/singsong
Published: 2023

6. Creative divergent synthesis with generative models

Author: Chemla--Romeu-Santos, Axel and Esling, Philippe
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Machine learning approaches now achieve impressive generation capabilities in numerous domains such as image, audio or video. However, most training \& evaluation frameworks revolve around the idea of strictly modelling the original data distribution rather than trying to extrapolate from it. This precludes the ability of such models to diverge from the original distribution and, hence, exhibit some creative traits. In this paper, we propose various perspectives on how this complicated goal could ever be achieved, and provide preliminary results on our novel training objective called \textit{Bounded Adversarial Divergence} (BAD).
Published: 2022

7. Challenges in creative generative models for music: a divergence maximization perspective

Author: Chemla--Romeu-Santos, Axel and Esling, Philippe
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Statistics - Applications
Abstract: The development of generative Machine Learning (ML) models in creative practices, enabled by the recent improvements in usability and availability of pre-trained models, is raising more and more interest among artists, practitioners and performers. Yet, the introduction of such techniques in artistic domains also revealed multiple limitations that escape current evaluation methods used by scientists. Notably, most models are still unable to generate content that lay outside of the domain defined by the training dataset. In this paper, we propose an alternative prospective framework, starting from a new general formulation of ML objectives, that we derive to delineate possible implications and solutions that already exist in the ML literature (notably for the audio and musical domain). We also discuss existing relations between generative models and computational creativity and how our framework could help address the lack of creativity in existing models., Comment: to be published in AI Music Creativity Conference proceedings (AIMC2022)
Published: 2022

8. Streamable Neural Audio Synthesis With Non-Causal Convolutions

Author: Caillon, Antoine and Esling, Philippe
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Deep learning models are mostly used in an offline inference fashion. However, this strongly limits the use of these models inside audio generation setups, as most creative workflows are based on real-time digital signal processing. Although approaches based on recurrent networks can be naturally adapted to this buffer-based computation, the use of convolutions still poses some serious challenges. To tackle this issue, the use of causal streaming convolutions have been proposed. However, this requires specific complexified training and can impact the resulting audio quality. In this paper, we introduce a new method allowing to produce non-causal streaming models. This allows to make any convolutional model compatible with real-time buffer-based processing. As our method is based on a post-training reconfiguration of the model, we show that it is able to transform models trained without causal constraints into a streaming model. We show how our method can be adapted to fit complex architectures with parallel branches. To evaluate our method, we apply it on the recent RAVE model, which provides high-quality real-time audio synthesis. We test our approach on multiple music and speech datasets and show that it is faster than overlap-add methods, while having no impact on the generation quality. Finally, we introduce two open-source implementation of our work as Max/MSP and PureData externals, and as a VST audio plugin. This allows to endow traditional digital audio workstation with real-time neural audio synthesis on a laptop CPU.
Published: 2022

9. HEAR: Holistic Evaluation of Audio Representations

Author: Turian, Joseph, Shier, Jordie, Khan, Humair Raj, Raj, Bhiksha, Schuller, Björn W., Steinmetz, Christian J., Malloy, Colin, Tzanetakis, George, Velarde, Gissel, McNally, Kirk, Henry, Max, Pinto, Nicolas, Noufi, Camille, Clough, Christian, Herremans, Dorien, Fonseca, Eduardo, Engel, Jesse, Salamon, Justin, Esling, Philippe, Manocha, Pranay, Watanabe, Shinji, Jin, Zeyu, and Bisk, Yonatan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. HEAR was launched as a NeurIPS 2021 shared challenge. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear., Comment: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track
Published: 2022

10. RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

Author: Caillon, Antoine and Esling, Philippe
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.
Published: 2021

11. Signal-domain representation of symbolic music for learning embedding spaces

Author: Prang, Mathieu and Esling, Philippe
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: A key aspect of machine learning models lies in their ability to learn efficient intermediate features. However, the input representation plays a crucial role in this process, and polyphonic musical scores remain a particularly complex type of information. In this paper, we introduce a novel representation of symbolic music data, which transforms a polyphonic score into a continuous signal. We evaluate the ability to learn meaningful features from this representation from a musical point of view. Hence, we introduce an evaluation method relying on principled generation of synthetic data. Finally, to test our proposed representation we conduct an extensive benchmark against recent polyphonic symbolic representations. We show that our signal-like representation leads to better reconstruction and disentangled features. This improvement is reflected in the metric properties and in the generation ability of the space learned from our signal-like representation according to music theory properties.
Published: 2021

12. Energy Consumption of Deep Generative Audio Models

Author: Douwes, Constance, Esling, Philippe, and Briot, Jean-Pierre
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In most scientific domains, the deep learning community has largely focused on the quality of deep generative models, resulting in highly accurate and successful solutions. However, this race for quality comes at a tremendous computational cost, which incurs vast energy consumption and greenhouse gas emissions. At the heart of this problem are the measures that we use as a scientific community to evaluate our work. In this paper, we suggest relying on a multi-objective measure based on Pareto optimality, which takes into account both the quality of the model and its energy consumption. By applying our measure on the current state-of-the-art in generative audio models, we show that it can drastically change the significance of the results. We believe that this type of metric can be widely used by the community to evaluate their work, while putting computational cost -- and in fine energy consumption -- in the spotlight of deep learning research., Comment: 5 pages, 2 figures, ICASSP 2022
Published: 2021

13. Spectrogram Inpainting for Interactive Generation of Instrument Sounds

Author: Bazin, Théis, Hadjeres, Gaëtan, Esling, Philippe, and Malt, Mikhail
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Modern approaches to sound synthesis using deep neural networks are hard to control, especially when fine-grained conditioning information is not available, hindering their adoption by musicians. In this paper, we cast the generation of individual instrumental notes as an inpainting-based task, introducing novel and unique ways to iteratively shape sounds. To this end, we propose a two-step approach: first, we adapt the VQ-VAE-2 image generation architecture to spectrograms in order to convert real-valued spectrograms into compact discrete codemaps, we then implement token-masked Transformers for the inpainting-based generation of these codemaps. We apply the proposed architecture on the NSynth dataset on masked resampling tasks. Most crucially, we open-source an interactive web interface to transform sounds by inpainting, for artists and practitioners alike, opening up to new, creative uses., Comment: 8 pages + references + appendices. 4 figures. Published as a conference paper at the The 2020 Joint Conference on AI Music Creativity, October 19-23, 2020, organized and hosted virtually by the Royal Institute of Technology (KTH), Stockholm, Sweden
Published: 2021
Full Text: View/download PDF

14. Creativity in the era of artificial intelligence

Author: Esling, Philippe and Devis, Ninon
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Creativity is a deeply debated topic, as this concept is arguably quintessential to our humanity. Across different epochs, it has been infused with an extensive variety of meanings relevant to that era. Along these, the evolution of technology have provided a plurality of novel tools for creative purposes. Recently, the advent of Artificial Intelligence (AI), through deep learning approaches, have seen proficient successes across various applications. The use of such technologies for creativity appear in a natural continuity to the artistic trend of this century. However, the aura of a technological artefact labeled as intelligent has unleashed passionate and somewhat unhinged debates on its implication for creative endeavors. In this paper, we aim to provide a new perspective on the question of creativity at the era of AI, by blurring the frontier between social and computational sciences. To do so, we rely on reflections from social science studies of creativity to view how current AI would be considered through this lens. As creativity is a highly context-prone concept, we underline the limits and deficiencies of current AI, requiring to move towards artificial creativity. We argue that the objective of trying to purely mimic human creative traits towards a self-contained ex-nihilo generative machine would be highly counterproductive, putting us at risk of not harnessing the almost unlimited possibilities offered by the sheer computational power of artificial agents., Comment: Keynote paper - JIM Conference 2020 - 12 pages
Published: 2020

15. Neural Granular Sound Synthesis

Author: Bitton, Adrien, Esling, Philippe, and Harada, Tatsuya
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small waveform windows. In order to control the synthesis, all grains in a given corpus are analyzed through a set of acoustic descriptors. This provides a representation reflecting some form of local similarities across the grains. However, the quality of this grain space is bound by that of the descriptors. Its traversal is not continuously invertible to signal and does not render any structured temporality. We demonstrate that generative neural networks can implement granular synthesis while alleviating most of its shortcomings. We efficiently replace its audio descriptor basis by a probabilistic latent space learned with a Variational Auto-Encoder. In this setting the learned grain space is invertible, meaning that we can continuously synthesize sound when traversing its dimensions. It also implies that original grains are not stored for synthesis. Another major advantage of our approach is to learn structured paths inside this latent space by training a higher-level temporal embedding over arranged grain sequences. The model can be applied to many types of libraries, including pitched notes or unpitched drums and environmental noises. We report experiments on the common granular synthesis processes as well as novel ones such as conditional sampling and morphing., Comment: presented for ICMC 2021 (2020 postponed)
Published: 2020

16. Timbre latent space: exploration and creative aspects

Author: Caillon, Antoine, Bitton, Adrien, Gatinet, Brice, and Esling, Philippe
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent studies show the ability of unsupervised models to learn invertible audio representations using Auto-Encoders. They enable high-quality sound synthesis but a limited control since the latent spaces do not disentangle timbre properties. The emergence of disentangled representations was studied in Variational Auto-Encoders (VAEs), and has been applied to audio. Using an additional perceptual regularization can align such latent representation with the previously established multi-dimensional timbre spaces, while allowing continuous inference and synthesis. Alternatively, some specific sound attributes can be learned as control variables while unsupervised dimensions account for the remaining features. New possibilities for timbre manipulations are enabled with generative neural networks, although the exploration and the creative use of their representations remain little. The following experiments are led in cooperation with two composers and propose new creative directions to explore latent sound synthesis of musical timbres, using specifically designed interfaces (Max/MSP, Pure Data) or mappings for descriptor-based synthesis.
Published: 2020

17. Ultra-light deep MIR by trimming lottery tickets

Author: Esling, Philippe, Bazin, Theis, Bitton, Adrien, Carsault, Tristan, and Devis, Ninon
Subjects: Computer Science - Machine Learning, Computer Science - Information Retrieval, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Current state-of-the-art results in Music Information Retrieval are largely dominated by deep learning approaches. These provide unprecedented accuracy across all tasks. However, the consistently overlooked downside of these models is their stunningly massive complexity, which seems concomitantly crucial to their success. In this paper, we address this issue by proposing a model pruning method based on the lottery ticket hypothesis. We modify the original approach to allow for explicitly removing parameters, through structured trimming of entire units, instead of simply masking individual weights. This leads to models which are effectively lighter in terms of size, memory and number of operations. We show that our proposal can remove up to 90% of the model parameters without loss of accuracy, leading to ultra-light deep MIR models. We confirm the surprising result that, at smaller compression ratios (removing up to 85% of a network), lighter models consistently outperform their heavier counterparts. We exhibit these results on a large array of MIR tasks including audio classification, pitch recognition, chord extraction, drum transcription and onset estimation. The resulting ultra-light deep learning models for MIR can run on CPU, and can even fit on embedded devices with minimal degradation of accuracy., Comment: 8 pages, 2 figures. 21st International Society for Music Information Retrieval Conference 11-15 October 2020, Montreal, Canada
Published: 2020

18. Diet deep generative audio models with structured lottery

Author: Esling, Philippe, Devis, Ninon, Bitton, Adrien, Caillon, Antoine, Chemla--Romeu-Santos, Axel, and Douwes, Constance
Subjects: Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Deep learning models have provided extremely successful solutions in most audio application fields. However, the high accuracy of these models comes at the expense of a tremendous computation cost. This aspect is almost always overlooked in evaluating the quality of proposed models. However, models should not be evaluated without taking into account their complexity. This aspect is especially critical in audio applications, which heavily relies on specialized embedded hardware with real-time constraints. In this paper, we build on recent observations that deep models are highly overparameterized, by studying the lottery ticket hypothesis on deep generative audio models. This hypothesis states that extremely efficient small sub-networks exist in deep models and would provide higher accuracy than larger models if trained in isolation. However, lottery tickets are found by relying on unstructured masking, which means that resulting models do not provide any gain in either disk size or inference time. Instead, we develop here a method aimed at performing structured trimming. We show that this requires to rely on global selection and introduce a specific criterion based on mutual information. First, we confirm the surprising result that smaller models provide higher accuracy than their large counterparts. We further show that we can remove up to 95% of the model weights without significant degradation in accuracy. Hence, we can obtain very light models for generative audio across popular methods such as Wavenet, SING or DDSP, that are up to 100 times smaller with commensurate accuracy. We study the theoretical bounds for embedding these models on Raspberry Pi and Arduino, and show that we can obtain generative models on CPU with equivalent quality as large GPU models. Finally, we discuss the possibility of implementing deep generative audio models on embedded platforms., Comment: 8 pages, 5 figures. Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx-20), Vienna, Austria, September 8-12, 2020
Published: 2020

19. Vector-Quantized Timbre Representation

Author: Bitton, Adrien, Esling, Philippe, and Harada, Tatsuya
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Timbre is a set of perceptual attributes that identifies different types of sound sources. Although its definition is usually elusive, it can be seen from a signal processing viewpoint as all the spectral features that are perceived independently from pitch and loudness. Some works have studied high-level timbre synthesis by analyzing the feature relationships of different instruments, but acoustic properties remain entangled and generation bound to individual sounds. This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. Timbre transfer can be performed by encoding any variable-length input signals into the quantized latent features that are decoded according to the learned timbre. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments as an intuitive modality to drive sound synthesis. Furthermore, we can map the discrete latent space to acoustic descriptors and directly perform descriptor-based synthesis.
Published: 2020

20. Cross-modal variational inference for bijective signal-symbol translation

Author: Chemla--Romeu-Santos, Axel, Ntalampiras, Stavros, Esling, Philippe, Haus, Goffredo, and Assayag, Gérard
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Extraction of symbolic information from signals is an active field of research enabling numerous applications especially in the Musical Information Retrieval domain. This complex task, that is also related to other topics such as pitch extraction or instrument recognition, is a demanding subject that gave birth to numerous approaches, mostly based on advanced signal processing-based algorithms. However, these techniques are often non-generic, allowing the extraction of definite physical properties of the signal (pitch, octave), but not allowing arbitrary vocabularies or more general annotations. On top of that, these techniques are one-sided, meaning that they can extract symbolic data from an audio signal, but cannot perform the reverse process and make symbol-to-signal generation. In this paper, we propose an bijective approach for signal/symbol translation by turning this problem into a density estimation task over signal and symbolic domains, considered both as related random variables. We estimate this joint distribution with two different variational auto-encoders, one for each domain, whose inner representations are forced to match with an additive constraint, allowing both models to learn and generate separately while allowing signal-to-symbol and symbol-to-signal inference. In this article, we test our models on pitch, octave and dynamics symbols, which comprise a fundamental step towards music transcription and label-constrained audio generation. In addition to its versatility, this system is rather light during training and generation while allowing several interesting creative uses that we outline at the end of the article., Comment: Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-2019)
Published: 2020

21. Using musical relationships between chord labels in automatic chord extraction tasks

Author: Carsault, Tristan, Nika, Jérôme, and Esling, Philippe
Subjects: Computer Science - Sound, Computer Science - Information Retrieval, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent researches on Automatic Chord Extraction (ACE) have focused on the improvement of models based on machine learning. However, most models still fail to take into account the prior knowledge underlying the labeling alphabets (chord labels). Furthermore, recent works have shown that ACE performances are converging towards a glass ceiling. Therefore, this prompts the need to focus on other aspects of the task, such as the introduction of musical knowledge in the representation, the improvement of the models towards more complex chord alphabets and the development of more adapted evaluation methods. In this paper, we propose to exploit specific properties and relationships between chord labels in order to improve the learning of statistical ACE models. Hence, we analyze the interdependence of the representations of chords and their associated distances, the precision of the chord alphabets, and the impact of the reduction of the alphabet before or after training of the model. Furthermore, we propose new training losses based on musical theory. We show that these improve the results of ACE systems based on Convolutional Neural Networks. By performing an in-depth analysis of our results, we uncover a set of related insights on ACE tasks based on statistical models, and also formalize the musical meaning of some classification errors., Comment: Accepted for publication in ISMIR, 2018
Published: 2019

22. Multi-Step Chord Sequence Prediction Based on Aggregated Multi-Scale Encoder-Decoder Network

Author: Carsault, Tristan, McLeod, Andrew, Esling, Philippe, Nika, Jérôme, Nakamura, Eita, and Yoshii, Kazuyoshi
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: This paper studies the prediction of chord progressions for jazz music by relying on machine learning models. The motivation of our study comes from the recent success of neural networks for performing automatic music composition. Although high accuracies are obtained in single-step prediction scenarios, most models fail to generate accurate multi-step chord predictions. In this paper, we postulate that this comes from the multi-scale structure of musical information and propose new architectures based on an iterative temporal aggregation of input labels. Specifically, the input and ground truth labels are merged into increasingly large temporal bags, on which we train a family of encoder-decoder networks for each temporal scale. In a second step, we use these pre-trained encoder bottleneck features at each scale in order to train a final encoder-decoder network. Furthermore, we rely on different reductions of the initial chord alphabet into three adapted chord alphabets. We perform evaluations against several state-of-the-art models and show that our multi-scale architecture outperforms existing methods in terms of accuracy and perplexity, while requiring relatively few parameters. We analyze musical properties of the results, showing the influence of downbeat position within the analysis window on accuracy, and evaluate errors using a musically-informed distance metric., Comment: Accepted for publication in MLSP, 2019
Published: 2019

23. Neural Drum Machine : An Interactive System for Real-time Synthesis of Drum Sounds

Author: Aouameur, Cyran, Esling, Philippe, and Hadjeres, Gaëtan
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, 68T99
Abstract: In this work, we introduce a system for real-time generation of drum sounds. This system is composed of two parts: a generative model for drum sounds together with a Max4Live plugin providing intuitive controls on the generative process. The generative model consists of a Conditional Wasserstein autoencoder (CWAE), which learns to generate Mel-scaled magnitude spectrograms of short percussion samples, coupled with a Multi-Head Convolutional Neural Network (MCNN) which estimates the corresponding audio signal from the magnitude spectrogram. The design of this model makes it lightweight, so that it allows one to perform real-time generation of novel drum sounds on an average CPU, removing the need for the users to possess dedicated hardware in order to use this system. We then present our Max4Live interface designed to interact with this generative model. With this setup, the system can be easily integrated into a studio-production environment and enhance the creative process. Finally, we discuss the advantages of our system and how the interaction of music producers with such tools could change the way drum tracks are composed., Comment: 8 pages, accepted at the International Conference on Computational Creativity 2019
Published: 2019

24. Universal audio synthesizer control with normalizing flows

Author: Esling, Philippe, Masuda, Naotake, Bardet, Adrien, Despres, Romeo, and Chemla--Romeu-Santos, Axel
Subjects: Computer Science - Machine Learning, Computer Science - Human-Computer Interaction, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: The ubiquity of sound synthesizers has reshaped music production and even entirely defined new music genres. However, the increasing complexity and number of parameters in modern synthesizers make them harder to master. Hence, the development of methods allowing to easily create and explore with synthesizers is a crucial need. Here, we introduce a novel formulation of audio synthesizer control. We formalize it as finding an organized latent audio space that represents the capabilities of a synthesizer, while constructing an invertible mapping to the space of its parameters. By using this formulation, we show that we can address simultaneously automatic parameter inference, macro-control learning and audio-based preset exploration within a single model. To solve this new formulation, we rely on Variational Auto-Encoders (VAE) and Normalizing Flows (NF) to organize and map the respective auditory and parameter spaces. We introduce the disentangling flows, which allow to perform the invertible mapping between separate latent spaces, while steering the organization of some latent dimensions to match target variation factors by splitting the objective as partial density evaluation. We evaluate our proposal against a large set of baseline models and show its superiority in both parameter inference and audio reconstruction. We also show that the model disentangles the major factors of audio variations as latent dimensions, that can be directly used as macro-parameters. We also show that our model is able to learn semantic controls of a synthesizer by smoothly mapping to its parameters. Finally, we discuss the use of our model in creative applications and its real-time implementation in Ableton Live, Comment: DaFX 2019
Published: 2019

25. Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders

Author: Bitton, Adrien, Esling, Philippe, Caillon, Antoine, and Fouilleul, Martin
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Generative models have thrived in computer vision, enabling unprecedented image processes. Yet the results in audio remain less advanced. Our project targets real-time sound synthesis from a reduced set of high-level parameters, including semantic controls that can be adapted to different sound libraries and specific tags. These generative variables should allow expressive modulations of target musical qualities and continuously mix into new styles. To this extent we train AEs on an orchestral database of individual note samples, along with their intrinsic attributes: note class, timbre domain and extended playing techniques. We condition the decoder for control over the rendered note attributes and use latent adversarial training for learning expressive style parameters that can ultimately be mixed. We evaluate both generative performances and latent representation. Our ablation study demonstrates the effectiveness of the musical conditioning mechanisms. The proposed model generates notes as magnitude spectrograms from any probabilistic latent code samples, with expressive control of orchestral timbres and playing styles. Its training data subsets can directly be visualized in the 3D latent representation. Waveform rendering can be done offline with GLA. In order to allow real-time interactions, we fine-tune the decoder with a pretrained MCNN and embed the full waveform generation pipeline in a plugin. Moreover the encoder could be used to process new input samples, after manipulating their latent attribute representation, the decoder can generate sample variations as an audio effect would. Our solution remains rather fast to train, it can directly be applied to other sound domains, including an user's libraries with custom sound tags that could be mapped to specific generative controls. As a result, it fosters creativity and intuitive audio style experimentations., Comment: this article has been accepted for presentation to the 22nd International Conference on Digital Audio Effects (DAFx 2019) ; we provide additional content on this companion repository https://github.com/acids-ircam/Expressive_WAE_FADER
Published: 2019

26. A database linking piano and orchestral MIDI scores with application to automatic projective orchestration

Author: Crestel, Léopold, Esling, Philippe, Heng, Lena, and McAdams, Stephen
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This article introduces the Projective Orchestral Database (POD), a collection of MIDI scores composed of pairs linking piano scores to their corresponding orchestrations. To the best of our knowledge, this is the first database of its kind, which performs piano or orchestral prediction, but more importantly which tries to learn the correlations between piano and orchestral scores. Hence, we also introduce the projective orchestration task, which consists in learning how to perform the automatic orchestration of a piano score. We show how this task can be addressed using learning methods and also provide methodological guidelines in order to properly use this database.
Published: 2018

27. Modulated Variational auto-Encoders for many-to-many musical timbre transfer

Author: Bitton, Adrien, Esling, Philippe, and Chemla-Romeu-Santos, Axel
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Generative models have been successfully applied to image style transfer and domain translation. However, there is still a wide gap in the quality of results when learning such tasks on musical audio. Furthermore, most translation models only enable one-to-one or one-to-many transfer by relying on separate encoders or decoders and complex, computationally-heavy models. In this paper, we introduce the Modulated Variational auto-Encoders (MoVE) to perform musical timbre transfer. We define timbre transfer as applying parts of the auditory properties of a musical instrument onto another. First, we show that we can achieve this task by conditioning existing domain translation techniques with Feature-wise Linear Modulation (FiLM). Then, we alleviate the need for additional adversarial networks by replacing the usual translation criterion by a Maximum Mean Discrepancy (MMD) objective. This allows a faster and more stable training along with a controllable latent space encoder. By further conditioning our system on several different instruments, we can generalize to many-to-many transfer within a single variational architecture able to perform multi-domain transfers. Our models map inputs to 3-dimensional representations, successfully translating timbre from one instrument to another and supporting sound synthesis from a reduced set of control parameters. We evaluate our method in reconstruction and generation tasks while analyzing the auditory descriptor distributions across transferred domains. We show that this architecture allows for generative controls in multi-domain transfer, yet remaining light, fast to train and effective on small datasets.
Published: 2018

28. Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics

Author: Esling, Philippe, Chemla--Romeu-Santos, Axel, and Bitton, Adrien
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Timbre spaces have been used in music perception to study the perceptual relationships between instruments based on dissimilarity ratings. However, these spaces do not generalize to novel examples and do not provide an invertible mapping, preventing audio synthesis. In parallel, generative models have aimed to provide methods for synthesizing novel timbres. However, these systems do not provide an understanding of their inner workings and are usually not related to any perceptually relevant information. Here, we show that Variational Auto-Encoders (VAE) can alleviate all of these limitations by constructing generative timbre spaces. To do so, we adapt VAEs to learn an audio latent space, while using perceptual ratings from timbre studies to regularize the organization of this space. The resulting space allows us to analyze novel instruments, while being able to synthesize audio from any point of this space. We introduce a specific regularization allowing to enforce any given similarity distances onto these spaces. We show that the resulting space provide almost similar distance relationships as timbre spaces. We evaluate several spectral transforms and show that the Non-Stationary Gabor Transform (NSGT) provides the highest correlation to timbre spaces and the best quality of synthesis. Furthermore, we show that these spaces can generalize to novel instruments and can generate any path between instruments to understand their timbre relationships. As these spaces are continuous, we study how audio descriptors behave along the latent dimensions. We show that even though descriptors have an overall non-linear topology, they follow a locally smooth evolution. Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure., Comment: Digital Audio Conference (DaFX 2018)
Published: 2018

29. Live Orchestral Piano, a system for real-time orchestral music generation

Author: Crestel, Léopold and Esling, Philippe
Subjects: Computer Science - Learning
Abstract: This paper introduces the first system for performing automatic orchestration based on a real-time piano input. We believe that it is possible to learn the underlying regularities existing between piano scores and their orchestrations by notorious composers, in order to automatically perform this task on novel piano inputs. To that end, we investigate a class of statistical inference models called conditional Restricted Boltzmann Machine (cRBM). We introduce a specific evaluation framework for orchestral generation based on a prediction task in order to assess the quality of different models. As prediction and creation are two widely different endeavours, we discuss the potential biases in evaluating temporal generative models through prediction tasks and their impact on a creative system. Finally, we introduce an implementation of the proposed model called Live Orchestral Piano (LOP), which allows to perform real-time projective orchestration of a MIDI keyboard input.
Published: 2016

30. Continuous Descriptor-Based Control for Deep Audio Synthesis

Author: Devis, Ninon, primary, Demerlé, Nils, additional, Nabi, Sarah, additional, Genova, David, additional, and Esling, Philippe, additional
Published: 2023
Full Text: View/download PDF

31. Is Quality Enoughƒ Integrating Energy Consumption in a Large-Scale Evaluation of Neural Audio Synthesis Models

Author: Douwes, Constance, primary, Bindi, Giovanni, additional, Caillon, Antoine, additional, Esling, Philippe, additional, and Briot, Jean-Pierre, additional
Published: 2023
Full Text: View/download PDF

32. Patchiness of deep-sea benthic Foraminifera across the Southern Ocean: Insights from high-throughput DNA sequencing

Author: Lejzerowicz, Franck, Esling, Philippe, and Pawlowski, Jan
Published: 2014
Full Text: View/download PDF

33. Dynamic Musical Orchestration Using Genetic Algorithms and a Spectro-Temporal Description of Musical Instruments

Author: Esling, Philippe, Carpentier, Grégoire, Agon, Carlos, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Di Chio, Cecilia, editor, Brabazon, Anthony, editor, Di Caro, Gianni A., editor, Ebner, Marc, editor, Farooq, Muddassar, editor, Fink, Andreas, editor, Grahl, Jörn, editor, Greenfield, Gary, editor, Machado, Penousal, editor, O’Neill, Michael, editor, Tarantino, Ernesto, editor, and Urquhart, Neil, editor
Published: 2010
Full Text: View/download PDF

34. Benthic monitoring of salmon farms in Norway using foraminiferal metabarcoding

Author: Pawlowski, Jan, Esling, Philippe, Lejzerowicz, Franck, Cordier, Tristan, Visco, Joana A., Martins, Catarina I. M., Kvalvik, Arne, Staven, Knut, and Cedhagen, Tomas
Published: 2015

35. Ultra-deep sequencing of foraminiferal microbarcodes unveils hidden richness of early monothalamous lineages in deep-sea sediments

Author: Lecroq, Béatrice, Lejzerowicz, Franck, Bachar, Dipankar, Christen, Richard, Esling, Philippe, Baerlocher, Loïc, Østerås, Magne, Farinelli, Laurent, and Pawlowski, Jan
Published: 2011

36. HEAR 2021: Holistic Evaluation of Audio Representations

Author: Turian, Joseph, Shier, Jordie, Khan, Humair Raj, Raj, Bhiksha, Schuller, Björn W., Steinmetz, Christian J., Malloy, Colin, Tzanetakis, George, Velarde, Gissel, McNally, Kirk, Henry, Max, Pinto, Nicolas, Noufi, Camille, Clough, Christian, Herremans, Dorien, Fonseca, Eduardo, Engel, Jesse, Salamon, Justin, Esling, Philippe, Manocha, Pranay, Watanabe, Shinji, Jin, Zeyu, and Bisk, Yonatan
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Artificial Intelligence (cs.AI), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear., to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track
Published: 2022
Full Text: View/download PDF

37. Challenges in creative generative models for music: a divergence maximization perspective

Author: Chemla–Romeu-Santos, Axel and Esling, Philippe
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Applications (stat.AP), Statistics - Applications, Machine Learning (cs.LG)
Abstract: The development of generative Machine Learning (ML) models in creative practices, enabled by the recent improvements in usability and availability of pre-trained models, is raising more and more interest among artists, practitioners and performers. Yet, the introduction of such techniques in artistic domains also revealed multiple limitations that escape current evaluation methods used by scientists. Notably, most models are still unable to generate content that lay outside of the domain defined by the training dataset. In this paper, we propose an alternative prospective framework, starting from a new general formulation of ML objectives, that we derive to delineate possible implications and solutions that already exist in the ML literature (notably for the audio and musical domain). We also discuss existing relations between generative models and computational creativity and how our framework could help address the lack of creativity in existing models., Comment: to be published in AI Music Creativity Conference proceedings (AIMC2022)
Published: 2022
Full Text: View/download PDF

38. Accurate multiplexing and filtering for high-throughput amplicon-sequencing

Author: Esling, Philippe, Lejzerowicz, Franck, and Pawlowski, Jan
Published: 2015
Full Text: View/download PDF

39. A multi-objective approach for sustainable generative audio models

Author: Douwes, Constance, Esling, Philippe, Briot, Jean-Pierre, Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Systèmes Multi-Agents (SMA), LIP6, and Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)
Subjects: [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: ArXiV publicationarXiv:2107.02621v1 [cs.LG] 6 Jul 2021; In recent years, the deep learning community has largely focused on the accuracy of deep generative models, resulting in impressive improvements in several research fields. However, this scientific race for quality comes at a tremendous computational cost, which incurs vast energy consumption and greenhouse gas emissions. If the current exponential growth of computational consumption persists, Artificial Intelligence (AI) will sadly become a considerable contributor to global warming. At the heart of this problem are the measures that we use as a scientific community to evaluate our work. Currently, researchers in the field of AI judge scientific works mostly based on the improvement in accuracy, log-likelihood, reconstruction or opinion scores, all of which entirely obliterates the actual computational cost of generative models. In this paper, we introduce the idea of relying on a multi-objective measure based on Pareto optimality, which simultaneously integrates the models accuracy, as well as the environmental impact of their training. By applying this measure on the current state-of-the-art in generative audio models, we show that this measure drastically changes the perceived significance of the results in the field, encouraging optimal training techniques and resource allocation. We hope that this type of measure will be widely adopted, in order to help the community to better evaluate the significance of their work, while bringing computational cost-and in fine carbon emissions-in the spotlight of AI research.
Published: 2021

40. Combining Real-Time Extraction and Prediction of Musical Chord Progressions for Creative Applications

Author: Carsault, Tristan, primary, Nika, Jérôme, additional, Esling, Philippe, additional, and Assayag, Gérard, additional
Published: 2021
Full Text: View/download PDF

41. The ACIDS Research project.

Author: Bindi, Giovanni, Caillon, Antoine, Chemla-Romeu-Santos, Axel, Demerlé, Nils, Devis, Ninon, Douwes, Constance, Esling, Philippe, Genova, David, and Nabi, Sarah
Subjects: DEEP learning, ARTIFICIAL neural networks, MACHINE learning, SCIENTIFIC method, GENERATIVE artificial intelligence, DIGITAL audio
Abstract: The article explores the transformative impact of generative models, particularly in music creation, shedding light on the promises, dangers, and opportunities they present. Topics include the historical relationship between music and technology, the potential of AI-powered generative models in reshaping the creative process, and concerns regarding intellectual property, societal implications, and environmental sustainability.
Published: 2023

42. Représentations variationnelles inversibles : une nouvelle approche pour la synthèse sonore

Author: Chemla-Romeu-Santos, Axel, Esling, Philippe, Ntalampiras, Stavros, and Institut de Recherche et Coordination Acoustique/Musique (IRCAM)
Subjects: [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]
Abstract: International audience; Dans cet article, nous proposons une nouvelle méthode de synthèse sonore basée sur des méthodes d’auto-encodage variationnel permettant simultanément d’inférer une représentation inversible d’un ensemble de données, que nous appelons ici espace génératif, et de générer à partir de ces propriétés structurelles extraites. Ces méthodes récentes, basées sur l’extraction d’espaces à faible dimensionnalité grâce à l’utilisation jointe de réseaux neuronaux et d’inférence bayésienne, permettent non seulement une grande flexibilité architecturale mais aussi l’extraction d’espaces génératifs haut-niveau, pouvant être explorés directement ou indirectement par l’interacteur. Néanmoins, le choix simultané de l’architecture et des données apprises conditionne de manière déterminante les propriétés émergentes des espaces génératifs extraits, dont l’organisation reste encore mal définie. Pour ce faire, nous proposons une approche expérimentale de ces systèmes par le développement d’une bibliothèque, vschaos, visant à développer une approche bijective entre l’évaluation de ces modèles, et leur exploitation dans des environnements créatifs.
Published: 2020

43. Attributes-aware deep music transformation

Author: Kawai, Lisa, Esling, Philippe, and Harada, Tatsuya
Abstract: Recent machine learning techniques have enabled a large variety of novel music generation processes. However, most approaches do not provide any form of interpretable control over musical attributes, such as pitch and rhythm. Obtaining control over the generation process is critically important for its use in real-life creative setups. Nevertheless, this problem remains arduous, as there are no known functions nor differentiable approximations to transform symbolic music with control of musical attributes.In this work, we propose a novel method that enables attributes-aware music transformation from any set of musical annotations, without requiring complicated derivative implementation. By relying on an adversarial confusion criterion on given musical annotations, we force the latent space of a generative model to abstract from these features. Then, reintroducing these features as conditioning to the generative function, we obtain a continuous control over them. To demonstrate our approach, we rely on sets of musical attributes computed by the jSymbolic library as annotations and conduct experiments that show that our method outperforms previous methods in control. Finally, comparing correlations between attributes and the transformed results show that our method can provide explicit control over any continuous or discrete annotation.
Published: 2020
Full Text: View/download PDF

44. Dynamic Musical Orchestration Using Genetic Algorithms and a Spectro-Temporal Description of Musical Instruments

Author: Esling, Philippe, primary, Carpentier, Grégoire, additional, and Agon, Carlos, additional
Published: 2010
Full Text: View/download PDF

45. Dynamic Computer-Aided Orchestration in Practice with Orchidea

Author: Cella, Carmine-Emanuele, primary, Ghisi, Daniele, additional, Maresz, Yan, additional, Petrolati, Alessandro, additional, Teiller, Alexandre, additional, and Esling, Philippe, additional
Published: 2021
Full Text: View/download PDF

46. FlowSynth: Simplifying Complex Audio Generation Through Explorable Latent Spaces with Normalizing Flows

Author: Esling, Philippe, primary, Masuda, Naotake, additional, and Chemla--Romeu-Santos, Axel, additional
Published: 2020
Full Text: View/download PDF

47. FOLDED CQT RCNN FOR REAL-TIME RECOGNITION OF INSTRUMENT PLAYING TECHNIQUES

Author: DUCHER, Jean-Francois, Esling, Philippe, Centre de recherche Informatique et Création Musicale (CICM), Esthétiques, musicologie, danse et créations musicales (MUSIDANSE), Université Paris 8 Vincennes-Saint-Denis (UP8)-Université Paris 8 Vincennes-Saint-Denis (UP8), Université Paris 8 Vincennes-Saint-Denis (UP8), Représentations musicales (Repmus), Sciences et Technologies de la Musique et du Son (STMS), IRCAM-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-IRCAM-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Esthétique, musicologie, danse et création musicale (MUSIDANSE), and Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)
Subjects: [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; In the past years, deep learning has produced state-of-the-art performance in timbre and instrument classification. However, only a few models currently deal with the recognition of advanced Instrument Playing Techniques (IPT). None of them have a real-time approach of this problem. Furthermore, most studies rely on a single sound bank for training and testing. Their methodology provides no assurance as to the generalization of their results to other sounds. In this article, we extend state-of-the-art convolutional neural networks to the classification of IPTs. We build the first IPT corpus from independent sound banks, annotate it with the JAMS standard and make it freely available. Our models yield consistently high accuracies on a homogeneous subset of this corpus. However, only a proper taxonomy of IPTs and specifically defined input transforms offer proper resilience when addressing the "minus-1db" methodology, which assesses the ability of the models to generalize. In particular, we introduce a novel Folded Constant Q-Transform adjusted to the requirements of IPT classification. Finally we discuss the use of our classifier in real-time.
Published: 2019

48. Cross-Modal Variational Inference For Bijective Signal-Symbol Translation

Author: Chemla--Romeu-Santos, Axel, Ntalampiras, Stavros, Esling, Philippe, Haus, Goffredo, Assayag, G��rard, Représentations musicales (Repmus), Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Laboratorio d'Informatica Musicale (LIM), and Università degli Studi di Milano [Milano] (UNIMI)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Statistics - Machine Learning, Machine Learning (stat.ML), [INFO]Computer Science [cs], Machine Learning (cs.LG), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Extraction of symbolic information from signals is an active field of research enabling numerous applications especially in the Musical Information Retrieval domain. This complex task, that is also related to other topics such as pitch extraction or instrument recognition, is a demanding subject that gave birth to numerous approaches, mostly based on advanced signal processing-based algorithms. However, these techniques are often non-generic, allowing the extraction of definite physical properties of the signal (pitch, octave), but not allowing arbitrary vocabularies or more general annotations. On top of that, these techniques are one-sided, meaning that they can extract symbolic data from an audio signal, but cannot perform the reverse process and make symbol-to-signal generation. In this paper, we propose an bijective approach for signal/symbol translation by turning this problem into a density estimation task over signal and symbolic domains, considered both as related random variables. We estimate this joint distribution with two different variational auto-encoders, one for each domain, whose inner representations are forced to match with an additive constraint, allowing both models to learn and generate separately while allowing signal-to-symbol and symbol-to-signal inference. In this article, we test our models on pitch, octave and dynamics symbols, which comprise a fundamental step towards music transcription and label-constrained audio generation. In addition to its versatility, this system is rather light during training and generation while allowing several interesting creative uses that we outline at the end of the article., Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-2019)
Published: 2019

49. APPRENTISSAGE PROFOND POUR LA RECONNAISSANCE EN TEMPS REEL DES MODES DE JEU INSTRUMENTAUX

Author: DUCHER, Jean-Francois, Esling, Philippe, Centre de recherche Informatique et Création Musicale (CICM), Esthétiques, musicologie, danse et créations musicales (MUSIDANSE), Université Paris 8 Vincennes-Saint-Denis (UP8)-Université Paris 8 Vincennes-Saint-Denis (UP8), Université Paris 8 Vincennes-Saint-Denis (UP8), Représentations musicales (Repmus), Sciences et Technologies de la Musique et du Son (STMS), IRCAM-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-IRCAM-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Esthétique, musicologie, danse et création musicale (MUSIDANSE), and Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)
Subjects: [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Au cours des dernières années, l'apprentissage profond s'est établi comme la nouvelle méthode de référence pour les problèmes de classification audio et notamment la reconnaissance d'instruments. Cependant, ces modèles ne traitent généralement pas la classification de modes de jeux avancés, question pourtant centrale dans la composition contemporaine. Les quelques études réalisées se cantonnent à une évaluation sur une seule banque de sons, dont rien n'assure la généralisation sur des données réelles. Dans cet article, nous étendons les méthodes de l'état de l'art à la classification de modes de jeu instrumentaux en temps réel à partir d'enregistrements de solistes. Nous montrons qu'une combinaison de réseaux convolutionnels (CNN) et récurrents (RNN) permet d'obtenir d'excellents résultats sur un corpus homogène provenant de 5 banques de sons. Toutefois, leur performance s'affaiblit sensiblement sur un corpus hétérogène, ce qui pourrait indiquer une faible capacité à généraliser à des données réelles. Nous proposons des pistes pour résoudre ce problème. Enfin, nous détaillons plusieurs utilisations possibles de nos modèles dans le cadre de systèmes interactifs.
Published: 2019

50. Using musical relationships between chord labels in automatic chord extraction tasks

Author: Carsault, Tristan, Nika, J��r��me, Esling, Philippe, Représentations musicales (Repmus), Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique, Image et Interaction - EA 2118 (L3I), Université de La Rochelle (ULR), ANR-14-CE24-0002,DYCI2,Dynamiques créatives de l'interaction improvisée(2014), ANR-17-CE38-0015,MAKIMOno,Analyse multivariée et inférence de savoir pour l'orchestration musicale(2017), Nika, Jérôme, Appel à projets générique - Dynamiques créatives de l'interaction improvisée - - DYCI22014 - ANR-14-CE24-0002 - Appel à projets générique - VALID, and Analyse multivariée et inférence de savoir pour l'orchestration musicale - - MAKIMOno2017 - ANR-17-CE38-0015 - AAPG2017 - VALID
Subjects: FOS: Computer and information sciences, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], Sound (cs.SD), Computer Science - Machine Learning, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Music information retrieval, [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS], [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], [INFO.INFO-NE] Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], [INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS], [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Computer Science - Sound, Machine Learning (cs.LG), Computer Science - Information Retrieval, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, [INFO.INFO-MM] Computer Science [cs]/Multimedia [cs.MM], [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], Deep learning, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], [SHS.MUSIQ] Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Automatic chord extraction, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Information Retrieval (cs.IR), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent researches on Automatic Chord Extraction (ACE) have focused on the improvement of models based on machine learning. However, most models still fail to take into account the prior knowledge underlying the labeling alphabets (chord labels). Furthermore, recent works have shown that ACE performances are converging towards a glass ceiling. Therefore, this prompts the need to focus on other aspects of the task, such as the introduction of musical knowledge in the representation, the improvement of the models towards more complex chord alphabets and the development of more adapted evaluation methods. In this paper, we propose to exploit specific properties and relationships between chord labels in order to improve the learning of statistical ACE models. Hence, we analyze the interdependence of the representations of chords and their associated distances, the precision of the chord alphabets, and the impact of the reduction of the alphabet before or after training of the model. Furthermore, we propose new training losses based on musical theory. We show that these improve the results of ACE systems based on Convolutional Neural Networks. By performing an in-depth analysis of our results, we uncover a set of related insights on ACE tasks based on statistical models, and also formalize the musical meaning of some classification errors., Accepted for publication in ISMIR, 2018
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

156 results on '"Esling, Philippe"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources