1. CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning
- Author
-
Bernd Edler, Emanuel A. P. Habets, Fabian-Robert Stöter, Soumitro Chakrabarty, Scientific Data Management (ZENITH), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), International Audio Laboratories Erlangen (AUDIO LABS), Friedrich-Alexander Universität Erlangen-Nürnberg (FAU)-Fraunhofer Institute for Integrated Circuits (Fraunhofer IIS), Fraunhofer (Fraunhofer-Gesellschaft)-Fraunhofer (Fraunhofer-Gesellschaft), The authors gratefully acknowledge the compute resources and support provided by the Erlangen Regional Computing Center (RRZE). They would like to thank A. Liutkus for his constructive criticism of the paper., and Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Inria Sophia Antipolis - Méditerranée (CRISAM)
- Subjects
Speaker count estimation ,Reverberation ,cocktail-party ,overlap detection ,Acoustics and Ultrasonics ,Artificial neural network ,Computer science ,Speech recognition ,Supervised learning ,Probabilistic logic ,Blind signal separation ,Speaker diarisation ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,[MATH.MATH-LO]Mathematics [math]/Logic [math.LO] ,Recurrent neural network ,Computer Science (miscellaneous) ,number of concurrent speakers ,Point estimation ,Electrical and Electronic Engineering ,0305 other medical science - Abstract
International audience; Estimating the maximum number of concurrent speakers from single-channel mixtures is a challenging problem and an essential first step to address various audio-based tasks such as blind source separation, speaker diarization, and audio surveillance. We propose a unifying probabilistic paradigm, where deep neural network architectures are used to infer output posterior distributions. These probabilities are in turn processed to yield discrete point estimates. Designing such architectures often involves two important and complementary aspects that we investigate and discuss. First, we study how recent advances in deep architectures may be exploited for the task of speaker count estimation. In particular, we show that convolutional recurrent neural networks outperform recurrent networks used in a previous study when adequate input features are used. Even for short segments of speech mixtures, we can estimate up to five speakers, with a significantly lower error than other methods. Second, through comprehensive evaluation, we compare the best-performing method to several baselines, as well as the influence of gain variations, different data sets, and reverberation. The output of our proposed method is compared to human performance. Finally, we give insights into the strategy used by our proposed method.
- Published
- 2019