Back to Search Start Over

Improved Convolutional Neural Network–Time-Delay Neural Network Structure with Repeated Feature Fusions for Speaker Verification

Authors :
Miaomiao Gao
Xiaojuan Zhang
Source :
Applied Sciences, Vol 14, Iss 8, p 3471 (2024)
Publication Year :
2024
Publisher :
MDPI AG, 2024.

Abstract

The development of deep learning greatly promotes the progress of speaker verification (SV). Studies show that both convolutional neural networks (CNNs) and dilated time-delay neural networks (TDNNs) achieve advanced performance in text-independent SV, due to their ability to sufficiently extract the local feature and the temporal contextual information, respectively. Also, the combination of the above two has achieved better results. However, we found a serious gridding effect when we apply the 1D-Res2Net-based dilated TDNN proposed in ECAPA-TDNN for SV, which indicates discontinuity and local information losses of frame-level features. To achieve high-resolution process for speaker embedding, we improve the CNN–TDNN structure with proposed repeated multi-scale feature fusions. Through the proposed structure, we can effectively improve the channel utilization of TDNN and achieve higher performance under the same TDNN channel. And, unlike previous studies that have all converted CNN features to TDNN features directly, we also studied the latent space transformation between CNN and TDNN to achieve efficient conversion. Our best method obtains 0.72 EER and 0.0672 MinDCF on VoxCeleb-O test set, and the proposed method performs better in cross-domain SV without additional parameters and computational complexity.

Details

Language :
English
ISSN :
20763417
Volume :
14
Issue :
8
Database :
Directory of Open Access Journals
Journal :
Applied Sciences
Publication Type :
Academic Journal
Accession number :
edsdoj.bbd5635c6744629abae44cf06875756
Document Type :
article
Full Text :
https://doi.org/10.3390/app14083471