Back to Search Start Over

CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning

Authors :
Bernd Edler
Emanuel A. P. Habets
Fabian-Robert Stöter
Soumitro Chakrabarty
Scientific Data Management (ZENITH)
Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM)
Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Inria Sophia Antipolis - Méditerranée (CRISAM)
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
International Audio Laboratories Erlangen (AUDIO LABS)
Friedrich-Alexander Universität Erlangen-Nürnberg (FAU)-Fraunhofer Institute for Integrated Circuits (Fraunhofer IIS)
Fraunhofer (Fraunhofer-Gesellschaft)-Fraunhofer (Fraunhofer-Gesellschaft)
The authors gratefully acknowledge the compute resources and support provided by the Erlangen Regional Computing Center (RRZE). They would like to thank A. Liutkus for his constructive criticism of the paper.
Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Inria Sophia Antipolis - Méditerranée (CRISAM)
Source :
IEEE/ACM Transactions on Audio, Speech and Language Processing, IEEE/ACM Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27 (2), pp.268-282. ⟨10.1109/TASLP.2018.2877892⟩, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27 (2), pp.268-282. ⟨10.1109/TASLP.2018.2877892⟩
Publication Year :
2019
Publisher :
HAL CCSD, 2019.

Abstract

International audience; Estimating the maximum number of concurrent speakers from single-channel mixtures is a challenging problem and an essential first step to address various audio-based tasks such as blind source separation, speaker diarization, and audio surveillance. We propose a unifying probabilistic paradigm, where deep neural network architectures are used to infer output posterior distributions. These probabilities are in turn processed to yield discrete point estimates. Designing such architectures often involves two important and complementary aspects that we investigate and discuss. First, we study how recent advances in deep architectures may be exploited for the task of speaker count estimation. In particular, we show that convolutional recurrent neural networks outperform recurrent networks used in a previous study when adequate input features are used. Even for short segments of speech mixtures, we can estimate up to five speakers, with a significantly lower error than other methods. Second, through comprehensive evaluation, we compare the best-performing method to several baselines, as well as the influence of gain variations, different data sets, and reverberation. The output of our proposed method is compared to human performance. Finally, we give insights into the strategy used by our proposed method.

Details

Language :
English
ISSN :
23299290 and 23299304
Database :
OpenAIRE
Journal :
IEEE/ACM Transactions on Audio, Speech and Language Processing, IEEE/ACM Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27 (2), pp.268-282. ⟨10.1109/TASLP.2018.2877892⟩, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27 (2), pp.268-282. ⟨10.1109/TASLP.2018.2877892⟩
Accession number :
edsair.doi.dedup.....a59c4d96ba6d172edd4148548b24811b
Full Text :
https://doi.org/10.1109/TASLP.2018.2877892⟩