Author: "Lee, Kong Aik" / Publication Type: Magazines - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lee, Kong Aik"' showing total 14 results

Start Over Author "Lee, Kong Aik" Publication Type Magazines

14 results on '"Lee, Kong Aik"'

1. Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Author: Liu, Tianchi, Lee, Kong Aik, Wang, Qiongqiong, and Li, Haizhou
Abstract: The residual neural networks (ResNet) demonstrate the impressive performance in automatic speaker verification (ASV). They treat the time and frequency dimensions equally, following the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. We address this issue and postulate Golden-Gemini Hypothesis, which posits the prioritization of temporal resolution over frequency resolution for ASV. The hypothesis is verified by conducting a systematic study on the impact of temporal and frequency resolutions on the performance, using a trellis diagram to represent the stride space. We further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based ASV models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.
Published: 2024
Full Text: View/download PDF

2. Cosine Scoring With Uncertainty for Neural Speaker Embedding

Author: Wang, Qiongqiong and Lee, Kong Aik
Abstract: Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.
Published: 2024
Full Text: View/download PDF

3. t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Author: Kinnunen, Tomi H., Lee, Kong Aik, Tak, Hemlata, Evans, Nicholas, and Nautsch, Andreas
Abstract: Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a set of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the concurrent t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.
Published: 2024
Full Text: View/download PDF

4. Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Author: Liu, Xuechen, Sahidullah, Md, Lee, Kong Aik, and Kinnunen, Tomi
Abstract: It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively.
Published: 2024
Full Text: View/download PDF

5. Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Author: Tao, Ruijie, Lee, Kong Aik, Das, Rohan Kumar, Hautamaki, Ville, and Li, Haizhou
Abstract: We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.
Published: 2023
Full Text: View/download PDF

6. Meta-Generalization for Domain-Invariant Speaker Verification

Author: Zhang, Hanyi, Wang, Longbiao, Lee, Kong Aik, Liu, Meng, Dang, Jianwu, and Meng, Helen
Abstract: Automatic speaker verification (ASV) exhibits unsatisfactory performance under domain mismatch conditions owing to intrinsic and extrinsic factors, such as variations in speaking styles and recording devices encountered in real-world applications. To ensure robust performance under unseen conditions, domain generalization has been explored. However, an inherent contradiction exists between model discrimination and domain generalization, in which the discrimination ability may be reduced while learning to generalize. In this paper, to extract discriminative yet domain-invariant representations, we propose the meta-generalized speaker verification (MGSV) via meta-learning. Specifically, we propose a metric-based distribution optimization and a gradient-based meta-optimization to simultaneously supervise the spatial relationship between embeddings and improve the generalization ability of the model on unseen domains. In addition, we design multiple-single (MS) and simulated speaker verification (SSV) sampling strategies based on single-domain (SD) and single-single (SS) strategies to simulate the train/test domain mismatch more relevantly, thereby mining transferable speaker-related knowledge. SSV is chosen as the most effective method, as it substantially improves the domain generalization by ensuring that the model has learned to discriminate efficiently. Additionally, to intuitively reflect the model performance on the unseen domains, the proposed method is validated on cross-genre, cross-device, and cross-dataset tasks. The experimental results demonstrate that our proposed method achieves remarkable performance in handling domain mismatch issues in speaker verification.
Published: 2023
Full Text: View/download PDF

7. ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

Author: Liu, Xuechen, Wang, Xin, Sahidullah, Md, Patino, Jose, Delgado, Hector, Kinnunen, Tomi, Todisco, Massimiliano, Yamagishi, Junichi, Evans, Nicholas, Nautsch, Andreas, and Lee, Kong Aik
Abstract: Benchmarking initiatives support the meaningful comparison of competing solutions to prominent problems in speech and language processing. Successive benchmarking evaluations typically reflect a progressive evolution from ideal lab conditions towards to those encountered in the wild. ASVspoof, the spoofing and deepfake detection initiative and challenge series, has followed the same trend. This article provides a summary of the ASVspoof 2021 challenge and the results of 54 participating teams that submitted to the evaluation phase. For the logical access (LA) task, results indicate that countermeasures are robust to newly introduced encoding and transmission effects. Results for the physical access (PA) task indicate the potential to detect replay attacks in real, as opposed to simulated physical spaces, but a lack of robustness to variations between simulated and real acoustic environments. The Deepfake (DF) task, new to the 2021 edition, targets solutions to the detection of manipulated, compressed speech data posted online. While detection solutions offer some resilience to compression effects, they lack generalization across different source datasets. In addition to a summary of the top-performing systems for each task, new analyses of influential data factors and results for hidden data subsets, the article includes a review of post-challenge results, an outline of the principal challenge limitations and a road-map for the future of ASVspoof.
Published: 2023
Full Text: View/download PDF

8. Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals

Author: Kinnunen, Tomi, Delgado, Hector, Evans, Nicholas, Lee, Kong Aik, Vestman, Ville, Nautsch, Andreas, Todisco, Massimiliano, Wang, Xin, Sahidullah, Md, Yamagishi, Junichi, and Reynolds, Douglas A.
Abstract: Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs. The reliability of spoofing CMs is typically gauged using the equal error rate (EER) metric. The primitive EER fails to reflect application requirements and the impact of spoofing and CMs upon ASV and its use as a primary metric in traditional ASV research has long been abandoned in favour of risk-based approaches to assessment. This paper presents several new extensions to the tandem detection cost function (t-DCF), a recent risk-based approach to assess the reliability of spoofing CMs deployed in tandem with an ASV system. Extensions include a simplified version of the t-DCF with fewer parameters, an analysis of a special case for a fixed ASV system, simulations which give original insights into its interpretation and new analyses using the ASVspoof 2019 database. It is hoped that adoption of the t-DCF for the CM assessment will help to foster closer collaboration between the anti-spoofing and ASV research communities.
Published: 2020
Full Text: View/download PDF

9. Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

Author: Kukanov, Ivan, Trong, Trung Ngo, Hautamaki, Ville, Siniscalchi, Sabato Marco, Salerno, Valerio Mario, and Lee, Kong Aik
Abstract: Bottleneck features (BNFs) generated with a deep neural network (DNN) have proven to boost spoken language recognition accuracy over basic spectral features significantly. However, BNFs are commonly extracted using language-dependent tied-context phone states as learning targets. Moreover, BNFs are less phonetically expressive than the output layer in a DNN, which is usually not used as a speech feature because of its very high dimensionality hindering further post-processing. In this article, we put forth a novel deep learning framework to overcome all of the above issues and evaluate it on the 2017 NIST Language Recognition Evaluation (LRE) challenge. We use manner and place of articulation as speech attributes, which lead to low-dimensional “universal” phonetic features that can be defined across all spoken languages. To model the asynchronous nature of the speech attributes while capturing their intrinsic relationships in a given speech segment, we introduce a new training scheme for deep architectures based on a Maximal Figure of Merit (MFoM) objective. MFoM introduces non-differentiable metrics into the backpropagation-based approach, which is elegantly solved in the proposed framework. The experimental evidence collected on the recent NIST LRE 2017 challenge demonstrates the effectiveness of our solution. In fact, the performance of speech language recognition (SLR) systems based on spectral features is improved for more than 5% absolute Cavg. Finally, the F1 metric can be brought from 77.6% up to 78.1% by combining the conventional baseline phonetic BNFs with the proposed articulatory attribute features.
Published: 2020
Full Text: View/download PDF

10. Generalizing I-Vector Estimation for Rapid Speaker Recognition

Author: Xu, Longting, Lee, Kong Aik, Li, Haizhou, and Yang, Zhen
Abstract: An i-vector is a compact representation that captures both the speaker and session variabilities rendered in a spoken utterance. Over the past years, it has prevailed over other techniques and is now the de facto representation for text-independent speaker recognition. Standard i-vector extraction requires intense computation at run-time. Reducing the computation will allow effective use of i-vector in more applications. Such intense computation arises from the posterior covariance matrix, when estimating the i-vector. There have been studies on how to simplify the computation of posterior covariance matrix with modest success. In this paper, we propose a novel approach to i-vector extraction without the need to evaluate the full posterior covariance thereby speeding up the run-time extraction process. This is achieved by generalizing the i-vector estimation in two ways. First, we introduce the use of occupancy reweighting in conjunction with whitening over the Baum-Welch statistics as part of the preprocessing step. Second, we introduce the so-called subspace-orthogonalizing prior (SOP) to replace the standard Gaussian prior in i-vector formulation. Experiments conducted on the extended-core task of NIST SRE'10 show that the proposed rapid SOP approach achieves considerable speed-up over the standard i-vector with comparable equal error rates.
Published: 2018
Full Text: View/download PDF

11. Direct Optimization of the Detection Cost for I-Vector-Based Spoken Language Recognition

Author: Sizov, Aleksandr, Lee, Kong Aik, and Kinnunen, Tomi
Abstract: We explore a method to boost discriminative capabilities of probabilistic linear discriminant analysis (PLDA) model without losing its generative advantages. We show a sequential projection and training steps leading to a classifier that operates in the original i-vector space but is discriminatively trained in a low-dimensional PLDA latent subspace. We use extended Baum–Welch technique to optimize the model with respect to two objective functions for discriminative training. One of them is the well-known maximum mutual information objective, while the other one is a new objective that we propose to approximate the language detection cost. We evaluate the performance on NIST language recognition evaluation (LRE) 2015 and our development dataset comprised of the utterances from previous LREs. We improve the detection cost by 10% and 6% relative compared to our fine-tuned generative and discriminative baselines, and by 10% over the best of our previously reported results. The proposed approximation method of the cost function and PLDA subspace training are applicable for a broad range of tasks.
Published: 2017
Full Text: View/download PDF

12. A Dual Latent Variable Personalized Dialogue Agent

Author: Lee, Jing Yang, Lee, Kong Aik, and Gan, Woon Seng
Abstract: Personalized dialogue agents are capable of generating responses consistent with a specific persona. Typically, personalized dialogue agents generate responses based on both the dialogue history and a representation of the agent’s desired persona. As it is impractical to obtain the persona representations for every interlocutor in real-world implementations, recent works have explored the possibility of generating personalized dialogue by finetuning the agent with dialogue examples corresponding to a given persona instead. However, in real-world implementations, a sufficient number of corresponding dialogue examples are also rarely available. Hence, in this paper, we introduce the Dual Latent Variable Generator (DLVGen), a variational personalized dialogue agent capable of generating personalized dialogue without any persona information or any corresponding dialogue examples. Unlike previous works, DLVGen models the latent distribution over potential dialogue response intents as well as the latent distribution over the agent’s potential persona. During inference, latent variables are sampled from both distributions and fed to the decoder. Extensive experiments on the popular ConvAI2 personalized dialogue corpus show that DLVGen is capable of generating natural, persona consistent responses. Additionally, we also introduce a variance regularization and response selection approach which further improved overall response quality.
Published: 2023
Full Text: View/download PDF

13. Total Variability Modeling Using Source-Specific Priors

Author: Shepstone, Sven Ewan, Lee, Kong Aik, Li, Haizhou, Tan, Zheng-Hua, and Jensen, Soren Holdt
Abstract: In total variability modeling, variable length speech utterances are mapped to fixed low-dimensional i-vectors. Central to computing the total variability matrix and i-vector extraction, is the computation of the posterior distribution for a latent variable conditioned on an observed feature sequence of an utterance. In both cases the prior for the latent variable is assumed to be non-informative, since for homogeneous datasets there is no gain in generality in using an informative prior. This work shows in the heterogeneous case, that using informative priors for computing the posterior, can lead to favorable results. We focus on modeling the priors using minimum divergence criterion or factor analysis techniques. Tests on the NIST 2008 and 2010 Speaker Recognition Evaluation (SRE) dataset show that our proposed method beats four baselines: For i-vector extraction using an already trained matrix, for the short2-short3 task in SRE'08, five out of eight female and four out of eight male common conditions, were improved. For the core-extended task in SRE'10, four out of nine female and six out of nine male common conditions were improved. When incorporating prior information into the training of the T matrix itself, the proposed method beats the baselines for six out of eight female and five out of eight male common conditions, for SRE'08, and five and six out of nine conditions, for the male and female case, respectively, for SRE'10. Tests using factor analysis for estimating priors show that two priors do not offer much improvement, but in the case of three separate priors (sparse data), considerable improvements were gained.
Published: 2016
Full Text: View/download PDF

14. ASVtorch toolkit: Speaker verification with deep neural networks

Author: Lee, Kong Aik, Vestman, Ville, and Kinnunen, Tomi
Abstract: The human voice differs substantially between individuals. This facilitates automatic speaker verification(ASV) — recognizing a person from his/her voice. ASV accuracy has substantially increased throughout the past decade due to recent advances in machine learning, particularly deep learning methods. An unfortunate downside has been substantially increased complexity of ASV systems. To help non-experts to kick-start reproducible ASV development, a state-of-the-art toolkit implementing various ASV pipelines and functionalities is required. To this end, we introduce a new open-source toolkit, ASVtorch, implemented in Python using the widely used PyTorch machine learning framework.
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

14 results on '"Lee, Kong Aik"'

1. Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

2. Cosine Scoring With Uncertainty for Neural Speaker Embedding

3. t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

4. Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

5. Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

6. Meta-Generalization for Domain-Invariant Speaker Verification

7. ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

8. Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals

9. Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

10. Generalizing I-Vector Estimation for Rapid Speaker Recognition

11. Direct Optimization of the Detection Cost for I-Vector-Based Spoken Language Recognition

12. A Dual Latent Variable Personalized Dialogue Agent

13. Total Variability Modeling Using Source-Specific Priors

14. ASVtorch toolkit: Speaker verification with deep neural networks

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Journal

Database

14 results on '"Lee, Kong Aik"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources