Robust f0 extraction from monophonic signals using adaptive sub-band filtering

Authors :: M. Kiran Reddy
Pradeep Rengaswamy
Pallab Dasgupta
Krothapalli Sreenivasa Rao
Source :: Speech Communication. 116:77-85
Publication Year :: 2020
Publisher :: Elsevier BV, 2020.
Abstract: Fundamental frequency (f0) extraction plays an important role in processing of monophonic signals such as speech and song. It is essential in various real-time applications such as emotion recognition, speech/singing voice discrimination and so on. Several f0 extraction methods have been proposed over the years, but no one algorithm works well for both speech and song. In this paper, we propose a novel approach that can accurately estimate f0 from speech as well as songs. First, voiced/unvoiced detection is performed using a novel RNN-LSTM based approach. Then, each voiced frame is decomposed into several sub-bands. From each sub-band of a voiced frame, the candidate pitch periods are identified using autocorrelation and non-linear operations. Finally, Viterbi decoding is used to form the final pitch contours. The performance of the proposed method is evaluated using popular speech (Keele, CMU-ARCTIC), and song (MIR-1K, LYRICS) databases. The evaluation results show that the proposed method performs equally well for speech and monophonic songs, and is better than the state-of-the-art methods. Further, the efficacy of proposed f0 extraction method is demonstrated by developing an interactive SARGAM learning tool.