Descriptor: "[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]"' showing total 2,072 results

Start Over Descriptor "[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]"

2,072 results on '"[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]"'

1. Learning a Riemannian manifold for the analysis-synthesis of nonstationary sounds

Author: Han, Han, Lostanlen, Vincent, Lagrange, Mathieu, Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), Signal, IMage et Son (LS2N - équipe SIMS ), and Nantes Université (Nantes Univ)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique)
Subjects: [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]
Abstract: International audience; Computer sound matching poses an inverse problem, namely, the identification of resynthesis parameters. Borrowing from the differentiable digital signal processing (DDSP) framework, we propose to automate its resolution by training a deep neural network. In this context, we aim to reach a compromise between the computational efficiency of parametric loss (P-loss) versus the psychoacoustical fidelity of spectral loss. Our approach, named ``perceptual--neural--physical'' (PNP), estimates the Riemannian metric which is associated to the composition between parametric synthesis and time--frequency scattering. By doing so, we locally linearize spectral loss and accelerate convergence. Furthermore, resorting to Tikhonov regularization improves the conditioning of the inverse problem. On an analysis--synthesis task for musical arpeggios, PNP training outperforms state-of-the-art methods P-loss (wav2shape) and STFT-based DDSP, as measured in terms of JTFS-based similarity between reference signal and reconstructed signal.; La transformation de sons par ordinateur pose un problème inverse d'identification des paramètres de resynthèse adéquats. Empruntant au formalisme du traitement du signal différentiable (DDSP), nous proposons d'automatiser sa résolution par entrainement d'un réseau de neurones profond. Dans ce contexte, nous visons un compromis entre l'efficacité computationnelle de la perte paramétrique et la fidélité psychoacoustique de la perte spectrale. Notre approche, baptisée perceptuelle-neuronalephysique (PNP), consiste à estimer la métrique riemannienne associée à la composition entre synthèse paramétrique et diffusion temps-fréquence (JTFS). Ce faisant, nous linéarisons localement la perte spectrale et accélérons la convergence. De plus, le recours à une régularisation de Tikhonov améliore le conditionnement du problème inverse. Par rapport à l'état de l'art (wav2shape et DDSP), et pour une tâche difficile d'analyse-synthèse d'arpège musical, l'entrainement via PNP rapproche le signal reconstruit du signal de référence, d'après une mesure de similarité de timbre fondée sur la JTFS.
Published: 2023

2. How to (Virtually) Train Your Speaker Localizer

Author: Srivastava, Prerak, Deleforge, Antoine, Politis, Archontis, Vincent, Emmanuel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Institut National de Recherche en Informatique et en Automatique (Inria), and University of Tampere [Finland]
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Sound, localization, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], image source, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, room acoustic simulation, directivity, direction-of-arrival, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Learning-based methods have become ubiquitous in speaker localization. Existing systems rely on simulated training sets for the lack of sufficiently large, diverse and annotated real datasets. Most room acoustics simulators used for this purpose rely on the image source method (ISM) because of its computational efficiency. This paper argues that carefully extending the ISM to incorporate more realistic surface, source and microphone responses into training sets can significantly boost the real-world performance of speaker localization systems. It is shown that increasing the training-set realism of a state-of-the-art direction-of-arrival estimator yields consistent improvements across three different real test sets featuring human speakers in a variety of rooms and various microphone arrays. An ablation study further reveals that every added layer of realism contributes positively to these improvements., Comment: Published in INTERSPEECH 2023
Published: 2023

3. Influence of context recognition on the representation of acoustic horizon: investigations on auditory distance perception

Author: Fleurence, Pierre, Aramaki, Mitsuko, Kronland-Martinet, Richard, Perception, Représentations, Image, Son, Musique (PRISM), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), GMEM - Centre National de Création Musicale (GMEM), Centre National de Création Musicale, Balint, Jamilla, and Fels, Janina
Subjects: [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], acoustic horizon, soundscape, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [SCCO.PSYC]Cognitive science/Psychology, auditory distance perception
Abstract: International audience; Distance perception in an audio content has long been investigated as sources’ distance, how far the source is from the perspective of a listener. Yet, an audio spatial scene is not only perceived as a sum of punctual audio entities but it also takes into account a broader aspect, based for example on shape or textures.Moreover, in the domain of soundscape studies, researchers describe soundscapes as a sonic environment with emphasis on the way it is perceived and understood by an individual. This leads to the existence of an interaction between the recognition of the sonic environment and its perception.Based on this framework, how does the auditory distance perception, defined as acoustic horizon, is influenced by the recognition of the environment?For that, we designed an experiment which showed that it does exist an interaction between the recognition and the perception of the acoustic horizon. In addition, we discussed how this interaction could be linked to multiple factors. We will present our results and the leads for a better perceptual characterization of the acoustic horizon.
Published: 2023

4. Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

Author: Lebourdais, Martin, Mariotte, Théo, Tahon, Marie, Larcher, Anthony, Laurent, Antoine, Montresor, Silvio, Meignier, Sylvain, Thomas, Jean-Hugh, Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), Laboratoire d'Acoustique de l'Université du Mans (LAUM), Le Mans Université (UM)-Centre National de la Recherche Scientifique (CNRS), This work was performed using HPC resources from GENCI–IDRIS (Grant 2022-AD011012565), French ANR GEM (ANR-19-CE38-0012), LMAC grant from Région Pays de la Loire., Le Mans Université, ANR-19-CE38-0012,GEM,Mesure de l'égalité entre les sexes dans les médias(2019), and European Project: 101007666,Exchanges for SPEech ReseArch aNd TechnOlogies
Subjects: overlap speech detection, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Speech segmentation, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], speech activity detection, multi-channel, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing.
Published: 2023

5. Spatial Integration of Dynamic Auditory Feedback in Electric Vehicle Interior

Author: Dupré, Théophile, Aramaki, Mitsuko, Denjean, Sébastien, Kronland-Martinet, Richard, Perception, Représentations, Image, Son, Musique (PRISM), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), and Stellantis France
Subjects: [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; With the development of electric motor vehicles, the domain of automotive sound design addresses new issues, and is now concerned by creating suitable and pleasant soundscapes inside the vehicle. For instance, the absence of predominant engine sound changes the driver perception of the dynamic of his car. Previous studies proposed relevant sonification strategies to augment the interior sound environment by bringing back vehicle dynamics with synthetic auditory cues. Yet, users report a lack of blending with the existing soundscape. In this study, we analyze acoustical and perceptual spatial characteristics of the car soundscape and show that that the spatial attributes of sound sources are fundamental to improve the perceptual coherency of the global environment.; Avec le développement des véhicules à moteur électrique, le domaine de la conception sonore automobile aborde de nouvelles questions, et s'attache désormais à créer des ambiances sonores adaptées et agréables à l'intérieur du véhicule. Par exemple, l'absence du son prédominant du moteur modifie la perception du conducteur quant au dynamisme de sa voiture. Des études antérieures ont proposé des stratégies de sonification pertinentes pour augmenter l'environnement sonore intérieur en ramenant la dynamique du véhicule à l'aide d'indices auditifs synthétiques. Pourtant, les utilisateurs signalent un manque d'intégration dans l'environnement sonore existant. Dans cette étude, nous analysons les caractéristiques spatiales acoustiques et perceptives du paysage sonore de la voiture et montrons que les attributs spatiaux des sources sonores sont fondamentaux pour améliorer la cohérence perceptive de l'environnement global.
Published: 2023

6. Perceptual–Neural–Physical Sound Matching

Author: Han Han, Vincent Lostanlen, Mathieu Lagrange, Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), Signal, IMage et Son (LS2N - équipe SIMS ), Nantes Université (Nantes Univ)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), and Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)
Subjects: auditory similarity, deep convolutional networks, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], physical modeling synthesis, scattering transform, sound matching
Abstract: International audience; Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis. Deep neural networks have achieved promising results in matching sustained harmonic tones. However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion. We attribute this problem to the inadequacy of loss function. On one hand, mean square error in the parametric domain, known as "P-loss", is simple and fast but fails to accommodate the differing perceptual significance of each parameter. On the other hand, mean square error in the spectrotemporal domain, known as "spectral loss", is perceptually motivated and serves in differentiable digital signal processing (DDSP). Yet, spectral loss is a poor predictor of pitch intervals and its gradient may be computationally expensive; hence a slow convergence. Against this conundrum, we present Perceptual-Neural-Physical loss (PNP). PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training. We instantiate PNP with physical modeling synthesis as decoder and joint time-frequency scattering transform (JTFS) as spectral representation. We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions.
Published: 2023

7. Leveraging Sparsity with Spiking Recurrent Neural Networks for Energy-Efficient Keyword Spotting

Author: Dampfhoffer, Manon, Mesquida, Thomas, Hardy, Emmanuel, Valentian, Alexandre, Anghel, Lorena, SPINtronique et TEchnologie des Composants (SPINTEC), Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche Interdisciplinaire de Grenoble (IRIG), Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Grenoble Alpes (UGA), Département Systèmes et Circuits Intégrés Numériques (DSCIN), Laboratoire d'Intégration des Systèmes et des Technologies (LIST (CEA)), Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA), Laboratoire d'électronique et des technologies de l'Information [Sfax] (LETI), and École Nationale d'Ingénieurs de Sfax | National School of Engineers of Sfax (ENIS)
Subjects: Spiking neural networks, speech commands, energy-efficiency, sparsity, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], keyword spotting, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: International audience; Bio-inspired Spiking Neural Networks (SNNs) are promising candidates to replace standard Artificial Neural Networks (ANNs) for energy-efficient keyword spotting (KWS) systems. In this work, we compare the trade-off between accuracy and energy-efficiency of a gated recurrent SNN (Spik-GRU) with a standard Gated Recurrent Unit (GRU) on the Google Speech Command Dataset (GSCD) v2. We show that, by taking advantage of the sparse spiking activity of the SNN, both accuracy and energy-efficiency can be increased. Lever-aging data sparsity by using spiking inputs, such as those produced by spiking audio feature extractors or dynamic sensors, can further improve energy-efficiency. We demonstrate state-of-the-art results for SNNs on GSCD v2 with up to 95.9% accuracy. Moreover, SpikGRU can achieve similar accuracy than GRU while reducing the number of operations by up to 82%.
Published: 2023

8. Explainable Audio Classification of Playing Techniques with Layer-wise Relevance Propagation

Author: Changhong Wang, Vincent Lostanlen, Mathieu Lagrange, Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Télécom Paris, Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), and Supported by an Atlanstic2020 project on Trainable Acoustic Sensors (TrAcS)
Subjects: playing technique recognition, music signal analysis, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Layer-wise relevance propagation, scattering transform, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Deep convolutional networks (convnets) in the time-frequency domain can learn an accurate and fine-grained categorization of sounds. For example, in the context of music signal analysis, this categorization may correspond to a taxonomy of playing techniques: vibrato, tremolo, trill, and so forth. However, convnets lack an explicit connection with the neurophysiological underpinnings of musical timbre perception. In this article, we propose a data-driven approach to explain audio classification in terms of physical attributes in sound production. We borrow from current literature in "explainable AI" (XAI) to study the predictions of a convnet which achieves an almost perfect score on a challenging task: i.e., the classification of five comparable real-world playing techniques from 30 instruments spanning seven octaves. Mapping the signal into the carrier-modulation domain using scattering transform, we decompose the networks' predictions over this domain with layer-wise relevance propagation. We find that regions highly-relevant to the predictions localized around the physical attributes with which the playing techniques are performed.
Published: 2023

9. Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction

Author: Dhaussy, Timothée, Jabaian, Bassam, Lefèvre, Fabrice, Horaud, Radu, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), IEEE Signal Processing Society, and ANR-20-CE33-0008,muDialBot,MUlti-party perceptually-active situated DIALog for human-roBOT interaction(2020)
Subjects: [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], multimodal, [INFO.INFO-RB]Computer Science [cs]/Robotics [cs.RO], speaker diarization, multimodal human-robot interaction, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: International audience; The speaker diarization task answers the question "who is speaking at a given time?". It represents valuable information for scene analysis in a domain such as robotics. In this paper, we introduce a temporal audiovisual fusion model for multiusers speaker diarization, with low computing requirement, a good robustness and an absence of training phase. The proposed method identifies the dominant speakers and tracks them over time by measuring the spatial coincidence between sound locations and visual presence. The model is generative, parameters are estimated online, and does not require training. Its effectiveness was assessed using two datasets, a public one and one collected in-house with the Pepper humanoid robot.
Published: 2023

10. 'Prediction of Sleepiness Ratings from Voice by Man and Machine': A Perceptual Experiment Replication Study

Author: P. Martin, Vincent, Ferron, Aymeric, Rouas, Jean-Luc, Philip, Pierre, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Sommeil, Addiction et Neuropsychiatrie [Bordeaux] (SANPSY), Université de Bordeaux (UB)-CHU de Bordeaux Pellegrin [Bordeaux]-Centre National de la Recherche Scientifique (CNRS), Université de Bordeaux (UB), Popular interaction with 3d content (Potioc), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), CHU Bordeaux [Bordeaux], and ANR-10-LABX-0043,BRAIN,Bordeaux Region Aquitaine Initiative for Neuroscience(2010)
Subjects: Experimental study replication, Paralinguistic, Sleepiness, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Voice, Perceptual study, [SCCO.LING]Cognitive science/Linguistics
Abstract: International audience; Following the release of the SLEEP corpus during the Interspeech 2019 paralinguistic continuous sleepiness estimation challenge, a paper presented at Interspeech 2020 by Huckvale et al. examined the reasons for the poor performance of the models proposed for this task. Careful analyses of the corpus led to the conclusion that its bias makes it hazardous to use for training machine learning systems, but a perceptual experiment on a subset of this corpus seemed to indicate that human hearing is however able to estimate sleepiness on this corpus.In this study, we present the results of the Endymion replication study, in which the same samples were rated by thirty French-speaking naive listeners. We then discuss the causes of the differences between the two studies and examine the effect of listener and sample characteristics on annotation performances
Published: 2023

11. Speech Modeling with a Hierarchical Transformer Dynamical VAE

Author: Lin, Xiaoyu, Bie, Xiaoyu, Leglaive, Simon, Girin, Laurent, Alameda-Pineda, Xavier, Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), ANR-19-CE33-0008,ML3RI,Apprentissage de bas-niveau d'ineractions robotiques multi-modales avec plusieurs personnes(2019), and European Project: 871245,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),SPRING(2020)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a corresponding sequence of latent vectors. In almost all the DVAEs of the literature, the temporal dependencies within each sequence and across the two sequences are modeled with recurrent neural networks. In this paper, we propose to model speech signals with the Hierarchical Transformer DVAE (HiT-DVAE), which is a DVAE with two levels of latent variable (sequence-wise and frame-wise) and in which the temporal dependencies are implemented with the Transformer architecture. We show that HiT-DVAE outperforms several other DVAEs for speech spectrogram modeling, while enabling a simpler training procedure, revealing its high potential for downstream low-level speech processing tasks such as speech enhancement.
Published: 2023

12. Detecting and reducing heterogeneity of error in acoustic classification

Author: Oliver C. Metcalf, Jos Barlow, Yves Bas, Erika Berenguer, Christian Devenish, Filipe França, Stuart Marsden, Charlotte Smith, Alexander C. Lees, Manchester Metropolitan University (MMU), Lancaster University, Centre d'Ecologie et des Sciences de la COnservation (CESCO), Muséum national d'Histoire naturelle (MNHN)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Centre d’Ecologie Fonctionnelle et Evolutive (CEFE), Université Paul-Valéry - Montpellier 3 (UPVM)-École Pratique des Hautes Études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche pour le Développement (IRD [France-Sud])-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE)-Institut Agro Montpellier, Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Université de Montpellier (UM), University of Oxford, School of Biological Sciences [Bristol], and University of Bristol [Bristol]
Subjects: autonomous recording unit, bioacoustics, ecoacoustics, [SDE.IE]Environmental Sciences/Environmental Engineering, Ecological Modeling, machine-learning, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [PHYS.MECA.BIOM]Physics [physics]/Mechanics [physics]/Biomechanics [physics.med-ph], automated signal recognition, Ecology, Evolution, Behavior and Systematics, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph]
Abstract: Passive acoustic monitoring can be an effective method for monitoring species, allowing the assembly of large audio datasets, removing logistical constraints in data collection and reducing anthropogenic monitoring disturbances. However, the analysis of large acoustic datasets is challenging and fully automated machine learning processes are rarely developed or implemented in ecological field studies. One of the greatest uncertainties hindering the development of these methods is spatial generalisability—can an algorithm trained on data from one place be used elsewhere?We demonstrate that heterogeneity of error across space is a problem that could go undetected using common classification accuracy metrics. Second, we develop a method to assess the extent of heterogeneity of error in a random forest classification model for six Amazonian bird species. Finally, we propose two complementary ways to reduce heterogeneity of error, by (i) accounting for it in the thresholding process and (ii) using a secondary classifier that uses contextual data.We found that using a thresholding approach that accounted for heterogeneity of precision error reduced the coefficient of variation of the precision score from a mean of 0.61 ± 0.17 (SD) to 0.41 ± 0.25 in comparison to the initial classification with threshold selection based on F-score. The use of a secondary, contextual classification with thresholding selection accounting for heterogeneity of precision reduced it further still, to 0.16 ± 0.13, and was significantly lower than the initial classification in all but one species. Mean average precision scores increased, from 0.66 ± 0.4 for the initial classification, to 0.95 ± 0.19, a significant improvement for all species.We recommend assessing—and if necessary correcting for—heterogeneity of precision error when using automated classification on acoustic data to quantify species presence as a function of an environmental, spatial or temporal predictor variable.
Published: 2022

13. Motion Estimation by Deep Learning in 2D Echocardiography: Synthetic Dataset and Validation

Author: Ewan Evain, Yunyun Sun, Khuram Faraz, Damien Garcia, Eric Saloux, Bernhard L. Gerber, Mathieu De Craene, Olivier Bernard, Bernard, Olivier, Modeling & analysis for medical imaging and Diagnosis (MYRIAD), Centre de Recherche en Acquisition et Traitement de l'Image pour la Santé (CREATIS), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Hospices Civils de Lyon (HCL)-Université Jean Monnet - Saint-Étienne (UJM)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Hospices Civils de Lyon (HCL)-Université Jean Monnet - Saint-Étienne (UJM)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS), Imagerie Ultrasonore, Service de cardiologie et de pathologie vasculaire [CHU Caen], Université de Caen Normandie (UNICAEN), Normandie Université (NU)-Normandie Université (NU)-CHU Caen, Normandie Université (NU)-Tumorothèque de Caen Basse-Normandie (TCBN)-Tumorothèque de Caen Basse-Normandie (TCBN), Signalisation, électrophysiologie et imagerie des lésions d’ischémie-reperfusion myocardique (SEILIRM), Normandie Université (NU)-Normandie Université (NU), Université Catholique de Louvain = Catholic University of Louvain (UCL), MedisysResearch Lab (Medisys), Philips Research, UCL - SSS/IREC/CARD - Pôle de recherche cardiovasculaire, and UCL - (SLuc) Service de pathologie cardiovasculaire
Subjects: Radiological and Ultrasound Technology, [INFO.INFO-IM] Computer Science [cs]/Medical Imaging, Deep learning, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], [SDV.MHEP.CSC] Life Sciences [q-bio]/Human health and pathology/Cardiology and cardiovascular system, Computer Science Applications, Motion, [SDV.MHEP.CSC]Life Sciences [q-bio]/Human health and pathology/Cardiology and cardiovascular system, Echocardiography, Ultrasound Imaging, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Image Processing, Computer-Assisted, [INFO.INFO-IM]Computer Science [cs]/Medical Imaging, Humans, Electrical and Electronic Engineering, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Motion Estimation, Software, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; Motion estimation in echocardiography plays an important role in the characterization of cardiac function, allowing the computation of myocardial deformation indices. However, there exist limitations in clinical practice, particularly with regard to the accuracy and robustness of measurements extracted from images. We therefore propose a novel deep learning solution for motion estimation in echocardiography. Our network corresponds to a modified version of PWC-Net which achieves high performance on ultrasound sequences. In parallel, we designed a novel simulation pipeline allowing the generation of a large amount of realistic B-mode sequences. These synthetic data, together with strategies during training and inference, were used to improve the performance of our deep learning solution, which achieved an average endpoint error of 0.07± 0.06 mm per frame and 1.20±0.67 mm between ED and ES on our simulated dataset. The performance of our method was further investigated on 30 patients from a publicly available clinical dataset acquired from a GE system. The method showed promise by achieving a mean absolute error of the global longitudinal strain of 2.5 ± 2.1% and a correlation of 0.77 compared to GLS derived from manual segmentation, much better than one of the most efficient methods in the state-of-the-art (namely the FFT-Xcorr block-matching method). We finally evaluated our method on an auxiliary dataset including 30 patients from another center and acquired with a different system. Comparable results were achieved, illustrating the ability of our method to maintain high performance regardless of the echocardiographic data processed.
Published: 2022

14. Spatially Oriented Format for Acoustics 2.1: Introduction and Recent Advances

Author: Piotr Majdak, Franz Zotter, Fabian Brinkmann, Julien De Muynke, Michael Mihocic, Markus Noisternig, Austrian Academy of Sciences (OeAW), Institute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Audio Communication Group, Technical University of Berlin, Lutheries - Acoustique - Musique (IJLRDA-LAM), Institut Jean Le Rond d'Alembert (DALEMBERT), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Espaces acoustiques et cognitifs (EAC), Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), ANR-20-JPIC-0002,PHÉ,Les Oreilles du Passé(2020), ANR-18-CE38-0004,RASPUTIN,Room Acoustic Simulation for Improved Spatial Perceptual Understanding using Real-Time Immersive Navigation(2018), and European Project: 101017743,SONICOM
Subjects: [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], General Engineering, Music
Abstract: International audience
Published: 2022

15. Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS

Author: Ogun, Sewade, Colotte, Vincent, Vincent, Emmanuel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and Grid'5000
Subjects: FOS: Computer and information sciences, Sound (cs.SD), pitch prediction, generative models, Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], text-to-speech, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Flow-based generative models are widely used in text-to-speech (TTS) systems to learn the distribution of audio features (e.g., Mel-spectrograms) given the input tokens and to sample from this distribution to generate diverse utterances. However, in the zero-shot multi-speaker TTS scenario, the generated utterances lack diversity and naturalness. In this paper, we propose to improve the diversity of utterances by explicitly learning the distribution of fundamental frequency sequences (pitch contours) of each speaker during training using a stochastic flow-based pitch predictor, then conditioning the model on generated pitch contours during inference. The experimental results demonstrate that the proposed method yields a significant improvement in the naturalness and diversity of speech generated by a Glow-TTS model that uses explicit stochastic pitch prediction, over a Glow-TTS baseline and an improved Glow-TTS model that uses a stochastic duration predictor., 5 pages with 3 figures, InterSpeech 2023
Published: 2023

16. La Chaîne de Compilation Syfala pour le Traitement du Signal Audio sur FPGA

Author: Popoff, Maxime, Michon, Romain, Risset, Tanguy, Cochard, Pierre, Letz, Stephane, Orlarey, Yann, de Dinechin, Florent, Systèmes Embarqués audio programmables (EMERAUDE), CITI Centre of Innovation in Telecommunications and Integration of services (CITI), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)- Centre national de création musicale (GRAME), Centre National de Création Musicale-Centre National de Création Musicale-Inria Lyon, Institut National de Recherche en Informatique et en Automatique (Inria), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria), and Univ Lyon, INSA Lyon, Inria, CITI, Grame, Emeraude
Subjects: HLS, Faust, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-ES]Computer Science [cs]/Embedded Systems, Audio DSP, Compilation on FPGA
Abstract: The implementation of real-time audio Digital Signal Processing (DSP) on FPGAhas been extensively studied in the past. Up to now, Audio IPs were designed either “by hand” inVHDL or using predefined IPs in block synthesis environments. The advent of High Level Synthesis(HLS) allows for a real compilation flow from high-level audio DSP specifications down to FPGAbit-streams. This paper presents the principles and the implementation of the first “audio DSPcompiler” targeting FPGAs. Our fully open-source system compiles audio DSP programs downto FPGA hardware and up to actual sound production. Many parameters such as the numberof output channels, sampling rate, etc. are adjusted automatically by the compiler. Softwareinterfaces can be generated to control the system in real-time. This compilation flow presents twoimportant technological breakthroughs for audio programmers: achieving ultra-low latency real-time audio DSP (few micro-seconds) and the possibility of easily deploying systems with a largenumber of audio channels.
Published: 2023

17. Expression-preserving face frontalization improves visually assisted speech processing

Author: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Apprentissage de modèles à partir de données massives (Thoth), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Jean Kuntzmann (LJK), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), and European Project: 871245,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),SPRING(2020)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Vision and Pattern Recognition (cs.CV), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computer Science - Computer Vision and Pattern Recognition, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], audio-visual speech enhancement, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Bayesian filtering, robust point registration, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), Artificial Intelligence, Student's t-distribution, variational auto-encoders, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Computer Vision and Pattern Recognition, lip reading, Software, face frontalization, ComputingMethodologies_COMPUTERGRAPHICS, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a frontalization methodology that preserves non-rigid facial deformations in order to boost the performance of visually assisted speech communication. The method alternates between the estimation of (i)~the rigid transformation (scale, rotation, and translation) and (ii)~the non-rigid deformation between an arbitrarily-viewed face and a face model. The method has two important merits: it can deal with non-Gaussian errors in the data and it incorporates a dynamical face deformation model. For that purpose, we use the generalized Student t-distribution in combination with a linear dynamic system in order to account for both rigid head motions and time-varying facial deformations caused by speech production. We propose to use the zero-mean normalized cross-correlation (ZNCC) score to evaluate the ability of the method to preserve facial expressions. The method is thoroughly evaluated and compared with several state of the art methods, either based on traditional geometric models or on deep learning. Moreover, we show that the method, when incorporated into deep learning pipelines, namely lip reading and speech enhancement, improves word recognition and speech intelligibilty scores by a considerable margin. Supplemental material is accessible at https://team.inria.fr/robotlearn/research/facefrontalization/, arXiv admin note: text overlap with arXiv:2202.00538
Published: 2023

18. A vector quantized masked autoencoder for speech emotion recognition

Author: Sadok, Samir, Leglaive, Simon, Séguier, Renaud, CentraleSupélec [campus de Rennes], Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, and Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)
Subjects: Self-supervised learning, FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, masked autoencoder, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Machine Learning (cs.LG), vector-quantized variational autoencoder, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], speech emotion recognition, Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER., https://samsad35.github.io/VQ-MAE-Speech/
Published: 2023

19. Spectrogram Inversion for Audio Source Separation via Consistency, Mixing, and Magnitude Constraints

Author: Magron, Paul, Virtanen, Tuomas, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and University of Tampere [Finland]
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), phase recovery, Audio source separation, alternating projections, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, speech enhancement, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Computer Science - Sound, spectrogram inversion, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and/or a target magnitude objective. Nonetheless, it is still unclear which set of constraints and problem formulation is the most appropriate in practice. In this paper, we design a general framework for deriving spectrogram inversion algorithm, which is based on formulating optimization problems by combining these objectives either as soft penalties or hard constraints. We solve these by means of algorithms that perform alternating projections on the subsets corresponding to each objective/constraint. Our framework encompasses existing techniques from the literature as well as novel algorithms. We investigate the potential of these approaches for a speech enhancement task. In particular, one of our novel algorithms outperforms other approaches in a realistic setting where the magnitudes are estimated beforehand using a neural network.
Published: 2023

20. Learning and controlling the source-filter representation of speech with a variational autoencoder

Author: Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier, Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ), CentraleSupélec [campus de Rennes], GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), ANR-19-CE33-0008,ML3RI,Apprentissage de bas-niveau d'ineractions robotiques multi-modales avec plusieurs personnes(2019), European Project: 871245,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),SPRING(2020), Société Française d'Acoustique (SFA), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, and Leglaive, Simon
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Linguistics and Language, Deep generative models, Variational autoencoder, Computer Science - Sound, Language and Linguistics, Representation learning, Machine Learning (cs.LG), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, Communication, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Computer Science Applications, Source-filter model, Modeling and Simulation, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Computer Vision and Pattern Recognition, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Software, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$., 23 pages, 7 figures, companion website: https://samsad35.github.io/site-sfvae/
Published: 2023

21. From HEAR to GEAR: Generative Evaluation of Audio Representations

Author: Lostanlen, Vincent, Yan, Lingyao, Yang, Xianyi, Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), Signal, IMage et Son (LS2N - équipe SIMS ), Nantes Université (Nantes Univ)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), and Sino-French Engineer School, Beihang University
Subjects: [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; The "Holistic Evaluation of Audio Representations" (HEAR) is an emerging research program towards statistical models that can transfer to diverse machine listening tasks. The originality of HEAR is to conduct a fair, "apples-to-apples" comparison of many deep learning models over many datasets, resulting in multitask evaluation metrics that are readily interpretable by practitioners. On the flip side, this comparison incurs a neural architecture search: as such, it is not directly interpretable in terms of audio signal processing. In this paper, we propose a complementary viewpoint on the HEAR benchmark, which we name GEAR: Generative Evaluation of Audio Representations. The key idea behind GEAR is to generate a dataset of sounds with few independent factors of variability, analyze it with HEAR embeddings, and visualize it with an unsupervised manifold learning algorithm. Visual inspection reveals stark contrasts in the global structure of the nearest-neighbor graphs associated to logmelspec, Open-L 3 , BYOL, CREPE, wav2vec2, GURA, and YAMNet. Although GEAR currently lacks mathematical refinement, we intend it as a proof of concept to show the potential of parametric audio synthesis in general-purpose machine listening research.
Published: 2023

22. Can we use Common Voice to train a Multi-Speaker TTS system?

Author: Ogun, Sewade, Colotte, Vincent, Vincent, Emmanuel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and Grid'5000
Subjects: FOS: Computer and information sciences, crowdsourced corpus, Sound (cs.SD), Multi-speaker text-to-speech, Audio and Speech Processing (eess.AS), non-intrusive quality estimation, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Common Voice, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages., To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qatar
Published: 2023

23. The Temporal Voice Areas are not 'just' Speech Areas

Author: Régis Trapeau, Etienne Thoret, Pascal Belin, Institut de Neurosciences de la Timone (INT), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique et Systèmes (LIS), Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS), Perception, Représentations, Image, Son, Musique (PRISM), Institute of Language, Communication and the Brain (ILCB), Université de Montréal (UdeM), Fondation pour la Recherche Médicale (AJE201214), Excellence Initiative of Aix-Marseille University (A*MIDEX), ANR-16-CE37-0011,PRIMAVOICE,Comparative Studies of Cerebral Voice Processing in Primates(2016), ANR-16-CONV-0002,ILCB,ILCB: Institute of Language Communication and the Brain(2016), ANR-11-LABX-0036,BLRI,Brain & LANGUAGE Research Institute(2011), and European Project: 788240,COVOPRIM
Subjects: [SDV.NEU.PC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Psychology and behavior, General Neuroscience, [SCCO.NEUR]Cognitive science/Neuroscience, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [SHS.PSY]Humanities and Social Sciences/Psychology
Abstract: International audience; The Temporal Voice Areas (TVAs) respond more strongly to speech sounds than to non-speech vocal sounds, but does this make them Temporal “Speech” Areas? We provide a perspective on this issue by combining univariate, multivariate, and representational similarity analyses of fMRI activations to a balanced set of speech and non-speech vocal sounds. We find that while speech sounds activate the TVAs more than non-speech vocal sounds, which is likely related to their larger temporal modulations in syllabic rate, they do not appear to activate additional areas nor are they segregated from the non-speech vocal sounds when their higher activation is controlled. It seems safe, then, to continue calling these regions the Temporal Voice Areas.
Published: 2023

24. Hearing elliptic movements reveals the imprint of action on prototypical geometries

Author: Etienne Thoret, Mitsuko Aramaki, Lionel Bringoux, Sølvi Ystad, Richard Kronland-Martinet, Perception, Représentations, Image, Son, Musique (PRISM), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), Institute of Language, Communication and the Brain (ILCB), Institut des Sciences du Mouvement Etienne Jules Marey (ISM), ANR-10-CORD-0003,MetaSon,Métaphores sonores(2010), ANR-14-CE24-0018,SoniMove,Informer, Guider et Apprendre l'Action par le Son(2014), ANR-16-CONV-0002,ILCB,ILCB: Institute of Language Communication and the Brain(2016), ANR-11-LABX-0036,BLRI,Brain & LANGUAGE Research Institute(2011), and ANR-11-IDEX-0001,Amidex,INITIATIVE D'EXCELLENCE AIX MARSEILLE UNIVERSITE(2011)
Subjects: Linguistics and Language, [SCCO]Cognitive science, [SDV.NEU.PC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Psychology and behavior, Cognitive Neuroscience, [SCCO.PSYC]Cognitive science/Psychology, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Developmental and Educational Psychology, [SHS.PSY]Humanities and Social Sciences/Psychology, Experimental and Cognitive Psychology, Language and Linguistics, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph]
Abstract: Within certain categories of geometric shapes, prototypical exemplars that best characterize the category have been evidenced. These geometric prototypes are classically identified through the visual and haptic perception or motor production and are usually characterized by their spatial dimension. However, whether prototypes can be recalled through the auditory channel has not been formally investigated. Here we address this question by using auditory cues issued from timbre-modulated friction sounds evoking human drawing elliptic movements. Since non-spatial auditory cues were previously found useful for discriminating distinct geometric shapes such as circles or ellipses, it is hypothesized that sound dynamics alone can evoke shapes such as an exemplary ellipse. Four experiments were conducted and altogether revealed that a common elliptic prototype emerges from auditory, visual, and motor modalities. This finding supports the hypothesis of a common coding of geometric shapes according to biological rules with a prominent role of sensory-motor contingencies in the emergence of such prototypical geometry.
Published: 2023

25. The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement

Author: Leglaive, Simon, Borne, Léonie, Tzinis, Efthymios, Sadeghi, Mostafa, Fraticelli, Matthieu, Wisdom, Scott, Pariente, Manuel, Pressnitzer, Daniel, Hershey, John R., Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ), CentraleSupélec, Pulse Audition, University of Illinois at Urbana-Champaign [Urbana], University of Illinois System, Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL), Google Inc, and Research at Google
Subjects: CHiME challenge, FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, speech enhancement, unsupervised domain adaptation, multi-speaker conversational speech, Computer Science - Sound, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Electrical Engineering and Systems Science - Audio and Speech Processing, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. In this paper, we introduce the unsupervised domain adaptation for conversational speech enhancement (UDASE) task of the 7th CHiME challenge. This task aims to leverage real-world noisy speech recordings from the target test domain for unsupervised domain adaptation of speech enhancement models. The target test domain corresponds to the multi-speaker reverberant conversational speech recordings of the CHiME-5 dataset, for which the ground-truth clean speech reference is not available. Given a CHiME-5 recording, the task is to estimate the clean, potentially multi-speaker, reverberant speech, removing the additive background noise. We discuss the motivation for the CHiME-7 UDASE task and describe the data, the task, and the baseline system.
Published: 2023
Full Text: View/download PDF

26. Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis

Author: Vahidi, Cyrus, Han, Han, Wang, Changhong, Lagrange, Mathieu, Fazekas, György, Lostanlen, Vincent, Queen Mary University of London (QMUL), Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), and Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in recent applications of deep learning to the analysis and synthesis of musical audio. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure: i.e., local amplitude variations up to 100 milliseconds or so. In this paper, we formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. We empirically demonstrate that time-frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, we motivate the need for a time-invariant and multiscale differentiable time-frequency model of similarity at the level of both local spectra and spectrotemporal modulations.
Published: 2023

27. Switching Machine Improvisation Models by Latent Transfer Entropy Criteria

Author: Shlomo Dubnov, Vignesh Gokul, Gerard Assayag, University of California [San Diego] (UC San Diego), University of California (UC), Représentations musicales (Repmus), Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), ERC, ANR-19-CE33-0010,MERCI,Réalité Musicale Mixte avec Instruments Créatifs(2019), and European Project: 883313,REACH
Subjects: granger causality, generative models, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], transfer entropy, musical information dynamics, cocreativity
Abstract: International audience; Music improvisation is the ability of musical generative systems to interact with either another music agent or a human improviser. This is a challenging task, as it is not trivial to define a quantitative measure that evaluates the creativity of the musical agent. It is also not feasible to create huge paired corpora of agents interacting with each other to train a critic system. In this paper we consider the problem of controlling machine improvisation by switching between several pre-trained models by finding the best match to an external control signal. We introduce a measure SymTE that searches for the best transfer entropy between representations of the generated and control signals over multiple generative models.
Published: 2023

28. DDSP-Piano: a Neural Sound Synthesizer Informed by Instrument Knowledge

Author: Renault, Lenny, Mignot, Rémi, Roebel, Axel, Analyse et synthèse sonores [Paris], Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), and European Project: H2020-951911,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),AI4Media(2020)
Subjects: Deep Learning, Musical Instruments, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO]Computer Science [cs], Sound Synthesis, Piano modeling, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Differentiable Digital Signal Processing, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Instrument sound synthesis using deep neural networks has received numerous improvements over the last couple of years. Among them, the Differentiable Digital Signal Processing (DDSP) framework has modernized the spectral modeling paradigm by including signal-based synthesizers and effects into fully differentiable architectures. The present work extends the applications of DDSP to the task of polyphonic sound synthesis, with the proposal of a differentiable piano synthesizer conditioned on MIDI inputs. The model architecture is motivated by high-level acoustic modeling knowledge of the instrument which, along with the sound structure priors inherent to the DDSP components, makes for a lightweight, interpretable and realistic sounding piano model. A subjective listening test has revealed that the proposed approach achieves better sound quality than a state-of-the-art neural-based piano synthesizer, but physical-modeling-based models still hold the best quality. Leveraging its interpretability and modularity, a qualitative analysis of the model behavior was also conducted: it highlights where additional modeling knowledge and optimization procedures could be inserted in order to improve the synthesis quality and the manipulation of sound properties. Eventually, the proposed differentiable synthesizer can be further used with other deep learning models for alternative musical tasks handling polyphonic audio and symbolic data
Published: 2023

29. The Games We Play: Exploring The Impact of ISMIR on Musicology

Author: Borsan, Vanessa Nina, Giraud, Mathieu, Groult, Richard, Algomus, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV)-Université de Picardie Jules Verne (UPJV)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
Subjects: [SHS.HISPHILSO]Humanities and Social Sciences/History, Philosophy and Sociology of Sciences, [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]
Abstract: International audience; Throughout history, there has consistently existed a period of time and spatial separation between the creation of new knowledge and technology and their adaptation for widespread implementation. The article delves into how musicology and computational music research interact and exchange their approaches. Specifically, it focuses on a study of ten years' worth of papers from the International Society for Music Information Retrieval (ISMIR) from 2012 to 2021. Over 1000 citations of ISMIR papers were reviewed, and out of these, 51 later works published in musicological venues drew from the findings of 28 ISMIR papers. Final results reveal that most contributions from ISMIR rarely make their way to musicology or humanities. In spite of this, the paper highlights four examples of successful knowledge transfers between the fields and discusses best practices for collaborations while addressing potential causes for such disparities. In the epilogue, we address the interlaced origins of the problem as stemming from new media or language, institutional restrictions, and the inability to engage in multidisciplinary communication.
Published: 2023

30. Associations between music training and the dynamics of writing music by hand

Author: Aurélien Bertiaux, Florence Levé, François Gabrielli, Mathieu Giraud, Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Algomus, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV)-Université de Picardie Jules Verne (UPJV)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Neuro-Dol (Neuro-Dol), Institut National de la Santé et de la Recherche Médicale (INSERM)-Université Clermont Auvergne [2017-2020] (UCA [2017-2020]), Université de Picardie Jules Verne (UPJV), and Giraud, Mathieu
Subjects: Musical notation, Optical music recognition, [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, 05 social sciences, [SDV.NEU.SC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Cognitive Sciences, 050109 social psychology, Experimental and Cognitive Psychology, Musical, 050105 experimental psychology, Linguistics, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Classical music, [SHS.MUSIQ] Humanities and Social Sciences/Musicology and performing arts, Sequence (music), Handwriting, Dynamics (music), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Natural (music), 0501 psychology and cognitive sciences, Psychology, Music, [SDV.NEU.SC] Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Cognitive Sciences
Abstract: International audience; Learning to write music in the staff notation used in Western classical music is part of the musician’s training. However, writing music by hand is rarely taught formally, and many musicians are not aware of the characteristics of their musical handwriting. As with any symbolic expression, musical handwriting is related to the underlying cognition of the musical structures being depicted. Trained musicians read, think, and play music with high-level structures in mind. It seems natural that they would also write music by hand with these structures in mind. Moreover, improving our understanding of handwriting may help to improve both optical music recognition (OMR) and music notation and composition interfaces. We investigated associations between music training and experience and the way people write music by hand. We made video-recordings of participants’ hands while they were copying or freely writing music and analysed the sequence in which they wrote the elements contained in the musical score. The results confirmed that experienced musicians wrote faster than beginners, that they were more likely to write chords from bottom to top, and that they tended to write the note-heads first, in a flowing fashion, and only afterwards use stems and beams to emphasize grouping, and add expressive markings.
Published: 2023

31. Transfer Learning and Bias Correction with Pre-trained Audio Embeddings

Author: Wang, Changhong, Richard, Gaël, McFee, Brian, Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Télécom Paris, New York University [New York] (NYU), NYU System (NYU), and European Project: HI-Audio
Subjects: Domain adaptation, FOS: Computer and information sciences, Sound (cs.SD), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Pre-trained audio embeddings, Bias correction, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Transfer learning
Abstract: Deep neural network models have become the dominant approach to a large variety of tasks within music information retrieval (MIR). These models generally require large amounts of (annotated) training data to achieve high accuracy. Because not all applications in MIR have sufficient quantities of training data, it is becoming increasingly common to transfer models across domains. This approach allows representations derived for one task to be applied to another, and can result in high accuracy with less stringent training data requirements for the downstream task. However, the properties of pre-trained audio embeddings are not fully understood. Specifically, and unlike traditionally engineered features, the representations extracted from pre-trained deep networks may embed and propagate biases from the model's training regime. This work investigates the phenomenon of bias propagation in the context of pre-trained audio representations for the task of instrument recognition. We first demonstrate that three different pre-trained representations (VGGish, OpenL3, and YAMNet) exhibit comparable performance when constrained to a single dataset, but differ in their ability to generalize across datasets (OpenMIC and IRMAS). We then investigate dataset identity and genre distribution as potential sources of bias. Finally, we propose and evaluate post-processing countermeasures to mitigate the effects of bias, and improve generalization across datasets., Comment: 7 pages, 3 figures, accepted to the conference of the International Society for Music Information Retrieval (ISMIR 2023)
Published: 2023
Full Text: View/download PDF

32. Adding context to content improves pattern matching: A study on Slovenian folksongs

Author: Borsan, Vanessa Nina, Giraud, Mathieu, Groult, Richard, Lecroq, Thierry, Algomus, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV)-Université de Picardie Jules Verne (UPJV)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
Subjects: [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [SHS.ANTHRO-SE]Humanities and Social Sciences/Social Anthropology and ethnology
Abstract: Pattern matching is a widely spread topic in MIR and related fields. It commonly aims to provide insights into repetitive instances in or across different kinds of music or cultures. The paper presents a study on the analysis of folksongs using symbolic music representation, including both music content and its contextual information. We release a corpus of 400 monophonic Slovenian tunes with structure, contour, and implied harmony annotations. We show that certain descriptors, such as contour types and harmonic ``stability'', depend on phrase position in tune. We propose a time and space-efficient algorithm based on suffix arrays and bit-vectors to match both music content (melodic sequence) and music context (descriptors). We show that pattern-matching queries combining melody and descriptors are more precise for classification tasks. We emphasize the importance of collaborative dynamics between content and context, as well as stress that not all research questions require the same amount of details. Consequently, our approach is encouraging computational music analysis to become more flexible. Lastly, the study aims to promote knowledge of Slovenian folksong.
Published: 2023

33. ENJEUX POÊTIQUES ET POLITIQUES DU PROJET ACOUSTIC COMMONS

Author: Parizot, Cedric, Institut de Recherches et d'Etudes sur les Mondes Arabes et Musulmans (IREMAM), Sciences Po Aix - Institut d'études politiques d'Aix-en-Provence (IEP Aix-en-Provence)-Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), Small Cooperation Project supported in part by the Creative Europe Programme of the European Union, IREMAM, and European Project
Subjects: arts sonores, corporéité, intercorporéité, commons, borders, intercoporality, [SHS.ANTHRO-SE]Humanities and Social Sciences/Social Anthropology and ethnology, son, art-science, frontières, corporality, interbeing, limites, espace, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], sound studies, sound art, spaces, boundaries, communs
Abstract: This text reflects on the ethnographic experience I had during the closing event of the European project Acoustic Commons (Creative Europe ) which took place in Aix en Provence between October 4th and 9th 2023. It reflects on how the practices of listening and artistic creation around streaming can offer a particularly fine-grained perspective on the relationships of correspondence and emergence we have with humans and non-humans at the turn of the 20th-21st century in the south of France and how we share our commons.; Ce texte revient sur l'expérience ethnographique que j'ai vécue lors de l'événement de clôture du projet européen Acoustic Commons (Creative Europe ) qui s'est déroulé à Aix en Provence entre le 4 et le 9 octobre 2023. Il réfléchit à la manière dont les pratiques d'écoute et de création artistique autour du streaming peuvent offrir une perspective particulièrement fine sur les relations de correspondance et d'émergence que nous entretenons avec les humains et les non-humains au tournant du 20-21ème siècle dans le sud de la France et comment nous partageons nos communs.
Published: 2023

34. MCMA: A Symbolic Multitrack Contrapuntal Music Archive

Author: Anna Aljanaki, Stefano Kalonaris, Gianluca Micchi, Eric Nichols, University of Tartu, RIKEN Center for Advanced Intelligence Project [Tokyo] (RIKEN AIP), RIKEN - Institute of Physical and Chemical Research [Japon] (RIKEN), Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Algomus, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV)-Université de Picardie Jules Verne (UPJV)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Université de Lille-Centrale Lille-Centre National de la Recherche Scientifique (CNRS), Université de Lille-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Université de Lille-Centrale Lille-Centre National de la Recherche Scientifique (CNRS), and Université de Picardie Jules Verne (UPJV)
Subjects: polyphony, computational musicology, [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], counterpoint, M1-5000, ComputingMilieux_MISCELLANEOUS, neural machine translation, symbolic music, Music
Abstract: International audience; We present Multitrack Contrapuntal Music Archive (MCMA, available at https://mcma.readthedocs.io), a symbolic dataset of pieces specifically curated to comprise, for any given polyphonic work, independent voices. So far, MCMA consists only of pieces from the Baroque repertoire but we aim to extend it to other contrapuntal music. MCMA is FAIR-compliant and it is geared towards musicological tasks such as (computational) analysis or education, as it brings to the fore contrapuntal interactions by explicit and independent representation. Furthermore, it affords for a more apt usage of recent advances in the field of natural language processing (e.g., neural machine translation). For example, MCMA can be particularly useful in the context of language based machine learning models for music generation. Despite its current modest size, we believe MCMA to be an important addition to online contrapuntal music databases, and we thus open it to contributions from the wider community, in the hope that MCMA can continue to grow beyond our efforts. In this article, we provide the rationale for this corpus, suggest possible use cases, offer an overview of the compiling process (data sourcing and processing), and present a brief statistical analysis of the corpus at the time of writing. Finally, future work that we endeavor to undertake is discussed.
Published: 2021

35. Détection robuste d'événements sonores

Author: Olvera, Michel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Université de Lorraine, Emmanuel Vincent, and ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018)
Subjects: Domain adaptation, Apprentissage profond, Audio source separation, Adaptation de domaine, Deep learning, Séparation de sources audio, Sound event detection, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Classification de scènes sonores, Détection d'événements sonores, Acoustic scene classification, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO]Computer Science [cs]
Abstract: From industry to general interest applications, computational analysis of sound scenes and events allows us to interpret the continuous flow of everyday sounds. One of the main degradations encountered when moving from lab conditions to the real world is due to the fact that sound scenes are not composed of isolated events but of multiple simultaneous events. Differences between training and test conditions also often arise due to extrinsic factors such as the choice of recording hardware and microphone positions, as well as intrinsic factors of sound events, such as their frequency of occurrence, duration and variability. In this thesis, we investigate problems of practical interest for audio analysis tasks to achieve robustness in real scenarios. Firstly, we explore the separation of ambient sounds in a practical scenario in which multiple short duration sound events with fast varying spectral characteristics (i.e., foreground sounds) occur simultaneously with background stationary sounds. We introduce the foreground-background ambient sound separation task and investigate whether a deep neural network with auxiliary information about the statistics of the background sound can differentiate between rapidly- and slowly-varying spectro-temporal characteristics. Moreover, we explore the use of per-channel energy normalization (PCEN) as a suitable pre-processing and the ability of the separation model to generalize to unseen sound classes. Results on mixtures of isolated sounds from the DESED and Audioset datasets demonstrate the generalization capability of the proposed separation system, which is mainly due to PCEN. Secondly, we investigate how to improve the robustness of audio analysis systems under mismatched training and test conditions. We explore two distinct tasks: acoustic scene classification (ASC) with mismatched recording devices and training of sound event detection (SED) systems with synthetic and real data. In the context of ASC, without assuming the availability of recordings captured simultaneously by mismatched training and test recording devices, we assess the impact of moment normalization and matching strategies and their integration with unsupervised adversarial domain adaptation. Our results show the benefits and limitations of these adaptation strategies applied at different stages of the classification pipeline. The best strategy matches source domain performance in the target domain.In the context of SED, we propose a PCEN based acoustic front-end with learned parameters. Then, we study the joint training of SED with auxiliary classification branches that categorize sounds as foreground or background according to their spectral properties. We also assess the impact of aligning the distributions of synthetic and real data at the frame or segment level based on optimal transport. Finally, we integrate an active learning strategy in the adaptation procedure. Results on the DESED dataset indicate that these methods are beneficial for the SED task and that their combination further improves performance on real sound scenes.; De l'industrie aux applications d'intérêt général, l'analyse automatique des scènes et événements sonores permet d'interpréter le flux continu de sons quotidiens. Une des principales dégradations rencontrées lors du passage des conditions de laboratoire au monde réel est due au fait que les scènes sonores ne sont pas composées d'événements isolés mais de plusieurs événements simultanés. Des différences entre les conditions d'apprentissage et de test surviennent aussi souvent en raison de facteurs extrinsèques, tels que le choix du matériel d'enregistrement et des positions des microphones, et de facteurs intrinsèques aux événements sonores, tels que leur fréquence d'occurrence, leur durée et leur variabilité. Dans cette thèse, nous étudions des problèmes d'intérêt pratique pour les tâches d'analyse sonore afin d'atteindre la robustesse dans des scénarios réels.Premièrement, nous explorons la séparation des sons ambiants dans un scénario pratique dans lequel plusieurs événements sonores de courte durée avec des caractéristiques spectrales à variation rapide (c'est-à-dire des sons d'avant-plan) se produisent simultanément à des sons stationnaires d'arrière-plan. Nous introduisons la tâche de séparation du son d'avant-plan et d'arrière-plan et examinons si un réseau de neurones profond avec des informations auxiliaires sur les statistiques du son d'arrière-plan peut différencier les caractéristiques spectro-temporelles à variation rapide et lente. De plus, nous explorons l'usage de la normalisation de l'énergie par canal (PCEN) comme prétraitement et la capacité du modèle de séparation à généraliser à des classes sonores non vues à l'apprentissage. Les résultats sur les mélanges de sons isolés à partir des jeux de données DESED et Audioset démontrent la capacité de généralisation du système de séparation proposé, qui est principalement due à PCEN.Deuxièmement, nous étudions comment améliorer la robustesse des systèmes d'analyse sonore dans des conditions d'apprentissage et de test différentes. Nous explorons deux tâches distinctes~: la classification de scène sonore (ASC) avec des matériels d'enregistrement différents et l'apprentissage de systèmes de détection d'événements sonores (SED) avec des données synthétiques et réelles.Dans le contexte de l'ASC, sans présumer de la disponibilité d'enregistrements capturés simultanément par les matériels d'enregistrement d'apprentissage et de test, nous évaluons l'impact des stratégies de normalisation et d'appariement des moments et leur intégration avec l'adaptation de domaine antagoniste non supervisée. Nos résultats montrent les avantages et les limites de ces stratégies d'adaptation appliquées à différentes étapes du pipeline de classification. La meilleure stratégie atteint les performances du domaine source dans le domaine cible.Dans le cadre de la SED, nous proposons un prétraitement basé sur PCEN avec des paramètres appris. Ensuite, nous étudions l'apprentissage conjoint du système de SED et de branches de classification auxiliaires qui catégorisent les sons en avant-plan ou arrière-plan selon leurs propriétés spectrales. Nous évaluons également l'impact de l'alignement des distributions des données synthétiques et réelles au niveau de la trame ou du segment par transport optimal. Enfin, nous intégrons une stratégie d'apprentissage actif dans la procédure d'adaptation. Les résultats sur le jeu de données DESED indiquent que ces méthodes sont bénéfiques pour la tâche de SED et que leur combinaison améliore encore les performances sur les scènes sonores réelles.
Published: 2022

36. A Model You Can Hear: Audio Identification with Playable Prototypes

Author: Loiseau, Romain, Bouvier, Baptiste, Teytaut, Yann, Vincent, Elliot, Aubry, Mathieu, Loic Landrieu, Laboratoire d'Informatique Gaspard-Monge (LIGM), École des Ponts ParisTech (ENPC)-Centre National de la Recherche Scientifique (CNRS)-Université Gustave Eiffel, Laboratoire sciences et technologies de l'information géographique (LaSTIG), Ecole des Ingénieurs de la Ville de Paris (EIVP)-École nationale des sciences géographiques (ENSG), Institut National de l'Information Géographique et Forestière [IGN] (IGN)-Université Gustave Eiffel-Institut National de l'Information Géographique et Forestière [IGN] (IGN)-Université Gustave Eiffel, Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Models of visual object recognition and scene understanding (WILLOW), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, and Institut National de Recherche en Informatique et en Automatique (Inria)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, ismir, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Computer Science - Sound, ismir2022, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; Machine learning techniques have proved useful for classifying and analyzing audio content. However, recent methods typically rely on abstract and high-dimensional representations that are difficult to interpret. Inspired by transformation-invariant approaches developed for image and 3D data, we propose an audio identification model based on learnable spectral prototypes. Equipped with dedicated transformation networks, these prototypes can be used to cluster and classify input audio samples from large collections of sounds. Our model can be trained with or without supervision and reaches state-of-the-art results for speaker and instrument identification, while remaining easily interpretable. The code is available at: https://github.com/romainloiseau/a-model-you-can-hear
Published: 2022

37. Towards modeling alternating patterns through inter-notes relations

Author: Vantalon, Perrine, Giraud, Mathieu, Groult, Richard, Lecroq, Thierry, Algomus, Equipe Traitement de l'information en Biologie Santé [TIBS - LITIS], Modélisation, Information et Systèmes - UR UPJV 4290 [MIS], Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV)-Université de Picardie Jules Verne (UPJV)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA), and Université de Picardie Jules Verne (UPJV)
Subjects: [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS], [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]
Abstract: International audience; A monophonic instrument can play at once several voices or lines, for example when interleaving pedal notes and scales. Such patterns may bear both melodic, harmonic, and rhythmic elements and are frequent in cello music. We propose a model of alternating patterns, where regularly spaced pitches are linked with some relation. We also propose an algorithm to list all such patterns. Perspectives include better corpus analysis, and extending and benchmarking such algorithms.
Published: 2022

38. Paradigmes d'Apprentissage Automatique Non-Supervisés pour les Représentations de la Similarité et de la Structure Musicale

Author: Marmoret, Axel, Parcimonie et Nouveaux Algorithmes pour le Signal et la Modélisation Audio (PANAMA), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-SIGNAUX ET IMAGES NUMÉRIQUES, ROBOTIQUE (IRISA-D5), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Contrat doctoral, Université Rennes 1, Frédéric Bimbot, and ANR-20-CE23-0010,LoRAiA,Approximations de rang faible pour l'intelligence artificielle(2020)
Subjects: ACM: H.: Information Systems/H.5: INFORMATION INTERFACES AND PRESENTATION (e.g., HCI)/H.5.5: Sound and Music Computing, Unsupervised Machine Learning and Optimization Methods, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Musique, Apprentissage Automatique Non-Supervisé et Méthodes d'Optimisation, [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], Segmentation Structurelle, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Structural Segmentation, Music, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Musical structure, defined as a simplified representation of the organization of a song, is an important musicological concept, but hard to automatically estimate. This thesis presents new methods to automatically estimate the structural segmentation of a song, focusing the study of music at the barscale. By developing a new segmentation algorithm (called ``CBM'') and by comparing several unsupervised compression schemes (from linear and multilinear algebra to neural networks), paradigms introduced in this thesis result in segmentation performance outperforming those of the unsupervised State-of-the-Art methods and almost similar with those of the global State-of-the-Art, obtained with supervised machine learning algorithms. In particular, as the methods described in this thesis are unsupervised, the estimation do not rely on annotated data, lowering the bias in the estimates related to ambiguity and subjectivity (inherent to musical structure) while limiting the loss in performance compared to the best supervised methods. In addition, some of the methods studied in this thesis (in particular Nonnegative Tucker Decomposition) allow to extract automatically interpretable parts of a song which may be used for other task than the estimation of structure, and participate in the development of interpretable machine and deep learning algorithms, which is a major field of research nowadays.; La structure musicale, définie comme la représentation simplifiée de l'organisation d'un morceau de musique, est un concept musicologique important mais néanmoins complexe à estimer automatiquement. Cette thèse présente de nouvelles méthodes pour estimer automatiquement la structure musicale, se focalisant sur l'étude à l'échelle de la mesure musicale. Par le développement d'un nouvel algorithme de segmentation (appelé ``CBM'') et par l'étude et la comparaison de différentes méthodes de compression non supervisées (allant de l'algèbre linéaire et multilinéaire aux réseaux de neurones), les paradigmes introduits dans cette thèse permettent d'obtenir des résultats quantitatifs dépassant l'Etat-de-l'Art non supervisé actuel et se rapprochant de l'Etat-de-l'Art global, issu de méthodes d'apprentissage avec supervision. En particulier, les méthodes décrites dans cette thèse étant non supervisées, l'estimation ne repose pas sur des bases de données annotées, permettant ainsi de mitiger les biais liés à l'ambiguïté et à la subjectivité (inhérents à la structure musicale), tout en limitant le perte en performance par rapport aux meilleures méthodes supervisées. Enfin, certaines méthodes étudiées dans cette thèse (en particulier la décomposition nonnégative en Tucker) permettent d'extraire automatiquement des parties interprétables de la chanson qui pourraient être utilisées pour d'autres tâches que l'estimation de structure, et s'intégrer dans le développement d'algorithmes interprétables d'apprentissage automatique profond, sujet de recherche majeur aujourd'hui.
Published: 2022

39. Architectural acoustic design: Observation of use cases including audio-only and multimodal auralizations

Author: Sebastien Jouan, David Poirier-Quinot, Vincent Boccara, David Thery, Brian F. G. Katz, Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Lutheries - Acoustique - Musique (IJLRDA-LAM), Institut Jean Le Rond d'Alembert (DALEMBERT), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), and Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)
Subjects: Architectural engineering, Acoustics and Ultrasonics, Test design, Computer science, Ecological validity, Mechanical Engineering, 0211 other engineering and technologies, Context (language use), 02 engineering and technology, Building and Construction, Virtual reality, 01 natural sciences, Variety (cybernetics), Identification (information), Noise, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 021105 building & construction, 0103 physical sciences, Use case, 010301 acoustics
Abstract: International audience; Auralization technology has reached a satisfactory level of ecological validity, enabling its use in architectural acoustic design. Only recently have the actual uses of auralization in the consulting community been explored, resulting in the identification of a variety of uses, including (1) to present to clients, (2) to test design ideas, (3) as a verification tool, (4) as a verification tool, (5) as a marketing tool, and (6) to improve internal company discussions. Taking advantage of methodologies from ergonomics research, the present study investigates effective uses through the observation of a collaboration project between an acoustic research team and an acoustic consultant, as a case study. Two spaces have been auralized in the context of the conception of a new skyscraper during the design phase of the project. The two spaces faced different problematics: an Atrium for which three different acoustic treatment options were suggested and experienced through multi-modal auralizations and audio-only auralizations of an Auditorium where an intrusive noise was to be acoustically treated. The ergonomic observation and analysis of this project revealed key impediments to the integration of auralization in common acoustic design practices.
Published: 2021

40. Gridless 3D Recovery of Image Sources from Room Impulse Responses

Author: Tom Sprunck, Antoine Deleforge, Yannick Privat, Cedric Foy, Institut de Recherche Mathématique Avancée (IRMA), Université de Strasbourg (UNISTRA)-Centre National de la Recherche Scientifique (CNRS), TOkamaks and NUmerical Simulations (TONUS), Université de Strasbourg (UNISTRA)-Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA)-Centre National de la Recherche Scientifique (CNRS)-Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Unité Mixte de Recherche en Acoustique Environnementale (UMRAE ), Centre d'Etudes et d'Expertise sur les Risques, l'Environnement, la Mobilité et l'Aménagement (Cerema)-Université Gustave Eiffel, and ANR-20-CE48-0013,DENISE,Attaque de problèmes difficiles en audio par des approches inverses non-linéaires économes en données(2020)
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Sound (cs.SD), Acoustic reflectors, convex optimization, room shape, Applied Mathematics, Classical Physics (physics.class-ph), FOS: Physical sciences, super-resolution, Physics - Classical Physics, sound field, Computer Science - Sound, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Signal Processing, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, Electrical and Electronic Engineering, Electrical Engineering and Systems Science - Audio and Speech Processing, sliding Frank-Wolfe
Abstract: Given a sound field generated by a sparse distribution of impulse image sources, can the continuous 3D positions and amplitudes of these sources be recovered from discrete, bandlimited measurements of the field at a finite set of locations, e.g., a multichannel room impulse response? Borrowing from recent advances in super-resolution imaging, it is shown that this nonlinear, non-convex inverse problem can be efficiently relaxed into a convex linear inverse problem over the space of Radon measures in R3. The linear operator introduced here stems from the fundamental solution of the free-field inhomogenous wave equation combined with the receivers' responses. An adaptation of the Sliding Frank-Wolfe algorithm is proposed to numerically solve the problem off-the-grid, i.e., in continuous 3D space. Simulated experiments show that the approach achieves near-exact recovery of hundreds of image sources using an arbitrarily placed compact 32-channel spherical microphone array in random rectangular rooms. The impact of noise, sampling rate and array diameter on these results is also examined., Comment: IEEE Signal Processing Letters, 2022
Published: 2022

41. How to (Virtually) Train Your Sound Source Localizer

Author: Srivastava, Prerak, Deleforge, Antoine, Politis, Archontis, Vincent, Emmanuel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Institut National de Recherche en Informatique et en Automatique (Inria), Parcimonie et Nouveaux Algorithmes pour le Signal et la Modélisation Audio (PANAMA), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-SIGNAUX ET IMAGES NUMÉRIQUES, ROBOTIQUE (IRISA-D5), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), and Analysis, perception and recognition of speech (PAROLE)
Subjects: image source, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], room acoustic simulation, directivity, localization, direction-of-arrival, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: Learning-based methods have become ubiquitous in sound source localization (SSL). Existing systems rely on simulated training sets for the lack of sufficiently large, diverse and annotated real datasets. Most room acoustic simulators used for this purpose rely on the image source method (ISM) because of its computational efficiency. This paper argues that carefully extending the ISM to incorporate more realistic surface, source and microphone responses into training sets can significantly boost the real-world performance of SSL systems. It is shown that increasing the training-set realism of a state-of-the-art direction-of-arrival estimator yields consistent improvements across three different real test sets featuring human speakers in a variety of rooms and various microphone arrays. An ablation study further reveals that every added layer of realism contributes positively to these improvements.
Published: 2022

42. Sleep deprivation measured by voice analysis

Author: Etienne Thoret, Thomas Andrillon, Caroline Gauriau, Damien Léger, Daniel Pressnitzer, Perception, Représentations, Image, Son, Musique (PRISM), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), Laboratoire des systèmes perceptifs (LSP), Département d'Etudes Cognitives - ENS Paris (DEC), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique et Systèmes (LIS), Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS), Institute of Language, Communication and the Brain (ILCB), T000362/2018-L, ANR-16-CONV-0002,ILCB,ILCB: Institute of Language Communication and the Brain(2016), ANR-11-LABX-0036,BLRI,Brain & LANGUAGE Research Institute(2011), ANR-11-IDEX-0001,Amidex,INITIATIVE D'EXCELLENCE AIX MARSEILLE UNIVERSITE(2011), ANR-19-CE28-0019,AMBISENSE,Accès à l'ambiguïté perceptive: mesures comportementales, métacognitives et physiologiques de la perception en situation d'incertitude(2019), and ANR-17-EURE-0017,FrontCog,Frontières en cognition(2017)
Subjects: [SDV.NEU.PC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Psychology and behavior, [SCCO.PSYC]Cognitive science/Psychology, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [SDV.MHEP.PHY]Life Sciences [q-bio]/Human health and pathology/Tissues and Organs [q-bio.TO], [SHS.PSY]Humanities and Social Sciences/Psychology, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: We show that sleep deprivation in otherwise normal and healthy adults can be detected through machine-learning analysis of vocal recordings. Importantly, we used fully generic acoustic features, derived from auditory models, together with our own machine learning interpretation method, derived from neuroscience. Sleep deprivation impacted two broad types of acoustic features: one related to speech rhythms, the other related to the timbre of the voice. Such features plausibly reflect two independent physiological processes: one explicit, the cognitive control of speech production, and the other implicit, the inflammation of the vocal apparatus. Crucially, the relative balance of the two processes varied widely across individuals, consistent with the known but unexplained variability in responses to sleep deprivation. Overall, our results suggest that the voice may be used as a “sleep stethoscope” to characterize the individual effects of sleep deprivation. Moreover, the method we applied is fully general and could be adapted to any future investigation of vocal biomarkers using machine-learning techniques.Author summarySleep deprivation has an ever-increasing impact on individuals and societies, from accidents to chronic conditions costing billions to health systems. Yet, to date, there is no quick and objective test for sleep deprivation. We show that sleep deprivation can be detected at the individual level with voice recordings, outlining future cost-effective and non-invasive “sleep stethoscopes”. Importantly, we focused on interpretability, which identified two independent physiological effects of sleep deprivation: a change in prosody, related to cognitive control, and a change in timbre, related to inflammation. This also revealed a striking variability in individual reactions to the same deprivation. The neuroscientific framework we developed, combining auditory models and machine learning, is freely available and could be adapted to any vocal biomarker.
Published: 2022

43. Performance above all ? energy consumption vs. performance for machine listening, a study on dcase task 4 baseline

Author: Serizel, Romain, Cornell, Samuele, Turpault, Nicolas, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Università Politecnica delle Marche [Ancona] (UNIVPM), Inria Rennes – Bretagne Atlantique, and Institut National de Recherche en Informatique et en Automatique (Inria)
Subjects: sound event detection, carbon footprint, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, efficiency, energy consumption, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], machine listening, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: In machine listening there is a tendency to resort to models with a growing number of parameters raising thus concerns about the practical viability of these due to their energy consumption. Reporting energy consumption of the models could be a first step to raise awareness on this matter. Yet, estimating the energy consumption across different conditions (hyper-parameters, GPU types etc.) poses some challenges in terms of biases and fairness of the comparison between different models and works. In this paper we perform an extensive study using the DCASE task 4 baseline system and monitor energy consumption and training time for different GPU types and batch sizes. The goal is to identify which aspects can have an impact on the estimation of the energy consumption and should be normalized for a fair comparison across systems. Additionally, we propose an analysis of the relationship between the energy consumption and the sound event detection performance that calls into question our current way to evaluate systems.
Published: 2022

44. Detection and identification of beehive piping audio signals

Author: Fourer, Dominique, Orlowska, Agnieszka, Informatique, BioInformatique, Systèmes Complexes (IBISC), Université d'Évry-Val-d'Essonne (UEVE)-Université Paris-Saclay, and ANR-19-CE48-0001,ASCETE,Analyse et Séparation des signaux Complexes: Exploiter la structure Temps-fréquence(2019)
Subjects: [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], smart beekeeping, tooting, audio signal recognition, bees piping signals, quacking
Abstract: International audience; Piping signals are particular sounds emitted by honey bees during the swarming season or sometimes when bees are exposed to specific factors during the life of the colony. Such sounds are of interest for beekeepers for predicting an imminent swarming of a beehive. The present study introduces a novel publicly available dataset made of several honey bee piping recordings allowing for the evaluation of future audio-based detection and recognition methods. First, we propose an analysis of the most relevant timbre features for discriminating between tooting and quacking sounds which are two distinct types of piping signals. Second, we comparatively assess several machine-learning-based methods designed for the detection and the identification of piping signals through a beehiveindependent 3-fold cross-validation methodology.
Published: 2022

45. Fast and efficient speech enhancement with variational autoencoders

Author: Sadeghi, Mostafa, Serizel, Romain, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Speech enhancement, Langevin dynamics, Computer Science - Sound, Machine Learning (cs.LG), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, variational autoencoder, generative model, Electrical Engineering and Systems Science - Signal Processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Unsupervised speech enhancement based on variational autoencoders has shown promising performance compared with the commonly used supervised methods. This approach involves the use of a pre-trained deep speech prior along with a parametric noise model, where the noise parameters are learned from the noisy speech signal with an expectationmaximization (EM)-based method. The E-step involves an intractable latent posterior distribution. Existing algorithms to solve this step are either based on computationally heavy Monte Carlo Markov Chain sampling methods and variational inference, or inefficient optimization-based methods. In this paper, we propose a new approach based on Langevin dynamics that generates multiple sequences of samples and comes with a total variation-based regularization to incorporate temporal correlations of latent vectors. Our experiments demonstrate that the developed framework makes an effective compromise between computational efficiency and enhancement quality, and outperforms existing methods.
Published: 2022

46. Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022)

Author: Lagrange, Mathieu, Mesaros, Annamaria, Pellegrini, Thomas, Richard, Gael, Serizel, Romain, Stowell, Dan, Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), University of Tampere [Finland], Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université de Toulouse (UT)-Toulouse Mind & Brain Institut (TMBI), Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT), Université Toulouse III - Paul Sabatier (UT3), Télécom Paris, Département Images, Données, Signal (IDS), Télécom ParisTech, Signal, Statistique et Apprentissage (S2A), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Tilburg University [Tilburg], Netspar, and Naturalis Biodiversity Center [Leiden]
Subjects: [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Published: 2022

47. Weighted variance variational autoencoder for speech enhancement

Author: Golmakani, Ali, Sadeghi, Mostafa, Alameda-Pineda, Xavier, Serizel, Romain, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, and Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Speech enhancement, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Computer Science - Sound, Machine Learning (cs.LG), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Student's t-distribution, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, variational autoencoder, generative model, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complexvalued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. While this is the commonly used approach, in this paper we propose a weighted variance generative model, where the contribution of each TF point in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram modeling and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
Published: 2022

48. Pipe organ buffet radiation patterns under different excitation strategies

Author: Villegas Curulla, Gonzalo, Canfield-Dafilou, Elliot, Domenighini, Piergiovanni, Fabre, Benoît, d'Alessandro, Christophe, Katz, Brian, Lutheries - Acoustique - Musique (IJLRDA-LAM), Institut Jean Le Rond d'Alembert (DALEMBERT), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Università degli Studi di Perugia = University of Perugia (UNIPG), ANR-20-JPIC-0002,PHÉ,Les Oreilles du Passé(2020), ANR-20-CE38-0014,PHEND,Le passé a des oreilles à Notre Dame(2020), and European Project: JPI-CH 2020,PHE
Subjects: [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Pipe Organ, Source Directivity, Musical Acoustics
Abstract: International audience; Studying the directivity of musical instruments such as pipe organs is challenging because of their great size and because they typically cannot be moved from the locations where they are built to the laboratory. This study compares the radiation pattern of organ buffets under different excitation conditions, using a 19th century French pipe organ. We investigate the positive and the great organ independently due to their spatial separation and size. The excitation strategies were comprised of cylindrical and an omnidirectional electroacoustic sources. Measurements were carried out on each organ buffet along a horizontal line spanning the width of the church. Results in octave bands are shown and compared with a discussion on possible causes for observed differences. Significant variations in directivity were observed for the 2 kHz to 4 kHz octave-band regions where scattering from pipes is expected to have a predominant effect, while little variation from omnidirectional was observed for lower frequency bands for both sources.
Published: 2022

49. Tactile perception of auditory roughness

Author: Corentin Bernard, Richard Kronland-Martinet, Madeline Fery, Sølvi Ystad, Etienne Thoret, Perception, Représentations, Image, Son, Musique (PRISM), Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique et Systèmes (LIS), Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS), Institute of Language, Communication and the Brain (ILCB), ANR-16-CONV-0002,ILCB,ILCB: Institute of Language Communication and the Brain(2016), ANR-11-LABX-0036,BLRI,Brain & LANGUAGE Research Institute(2011), and ANR-11-IDEX-0001,Amidex,INITIATIVE D'EXCELLENCE AIX MARSEILLE UNIVERSITE(2011)
Subjects: Pulmonary and Respiratory Medicine, auditory perception, [SDV.NEU.PC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Psychology and behavior, psychophysics, Pediatrics, Perinatology and Child Health, [SCCO.PSYC]Cognitive science/Psychology, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [SHS.PSY]Humanities and Social Sciences/Psychology, [SPI.MECA.VIBR]Engineering Sciences [physics]/Mechanics [physics.med-ph]/Vibrations [physics.class-ph], Tactile perception, roughness, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph]
Abstract: International audience; Auditory roughness resulting from fast temporal beatings is often studied by summing two pure tones with close frequencies. Interestingly, the tactile counterpart of auditory roughness can be provided through touch with vibrotactile actua- tors. However, whether auditory roughness could also be perceived through touch and whether it exhibits similar characteris- tics are unclear. Here, auditory roughness perception and its tactile counterpart were evaluated using pairs of pure tone stimuli. Results revealed similar roughness curves in both modalities, suggesting similar sensory processing. This study attests to the relevance of such a paradigm for investigating auditory and tactile roughness in a multisensory fashion.
Published: 2022

50. From the Lab to the Stage: Practical Considerations on Designing Performances with Immersive Virtual Musical Instruments

Author: Zappi, Victor, Berthaut, Florent, Mazzanti, Dario, Northeastern University [Boston], Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), and Fondazione Istituto Italiano di Tecnologia, Genova
Subjects: [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]
Abstract: Immersive virtual musical instruments (IVMIs) lie at the intersection between music technology and virtual reality. Being both digital musical instruments (DMIs) and elements of virtual environments (VEs), IVMIs have the potential to transport the musician into a world of imagination and unprecedented musical expression. But when the final aim is to perform live on stage, the employment of these technologies is anything but straightforward, for sharing the virtual musical experience with the audience gets quite arduous. In this chapter, we assess in detail the several technical and conceptual challenges linked to the composition of IVMI performances on stage, i.e., their scenography, providing a new critical perspective on IVMI performance and design. We first propose a set of dimensions meant to analyse IVMI scenographies, as well as to evaluate their compatibility with different instrument metaphors and performance rationales. Such dimensions are built from the specifics and constraints of DMIs and VEs; they include the level of immersion of musicians and spectators and provide an insight into the interaction techniques afforded by 3D user interfaces in the context of musical expression. We then analyse a number of existing IVMIs and stage setups, and finally suggest new ones, with the aim to facilitate the design of future immersive performances.
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

2,072 results on '"[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources