Start Over

PLDA inspired Siamese networks for speaker verification.

Authors :: Ramoji, Shreyas
Krishnan, Prashant
Ganapathy, Sriram
Source :: Computer Speech & Language. Nov2022, Vol. 76, pN.PAG-N.PAG. 1p.
Publication Year :: 2022
Abstract: The deep learning methodologies in state-of-the-art speaker recognition systems are predominantly limited to the extraction of recording level embeddings. This is usually followed by generative modeling of the embeddings to output the verification score. In this paper, we explore a fully neural approach where the neural model outputs the verification score directly, given the acoustic feature inputs. This model, termed as Siamese neural network (SiamNN), combines the embedding extraction and back-end modeling into a single processing pipeline. The back-end modeling is achieved using a neural approach to PLDA modeling, called neural probabilistic linear discriminant analysis (NPLDA). In the NPLDA model, the verification score is computed as a discriminative similarity function. The development of the single neural SiamNN model allows the joint optimization of all the modules using a verification cost. Several speaker recognition experiments are performed using SITW, VOiCES, and NIST SRE datasets where the proposed SiamNN model is shown to significantly improve over the state-of-art x-vector PLDA baseline system (relative improvements of up to 35% in the primary cost metric). We also provide a detailed analysis of the influence of hyper-parameters, choice of loss functions, and data sampling strategies for training the model. In particular, we highlight that the proposed soft detection cost function based optimization improves over other loss functions considered. [ABSTRACT FROM AUTHOR]