Author: "Lee, Kong Aik" - Searchworks@Jio Institute Digital Library Search Results

1. LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation

Author: Luong, Hieu-Thi, Li, Haoyang, Zhang, Lin, Lee, Kong Aik, and Chng, Eng Siong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Previous fake speech datasets were constructed from a defender's perspective to develop countermeasure (CM) systems without considering diverse motivations of attackers. To better align with real-life scenarios, we created LlamaPartialSpoof, a 130-hour dataset contains both fully and partially fake speech, using a large language model (LLM) and voice cloning technologies to evaluate the robustness of CMs. By examining information valuable to both attackers and defenders, we identify several key vulnerabilities in current CM systems, which can be exploited to enhance attack success rates, including biases toward certain text-to-speech models or concatenation methods. Our experimental results indicate that current fake speech detection system struggle to generalize to unseen scenarios, achieving a best performance of 24.44% equal error rate., Comment: 5 pages, submitted to ICASSP 2025
Published: 2024

2. Room Impulse Responses help attackers to evade Deep Fake Detection

Author: Luong, Hieu-Thi, Truong, Duc-Tuan, Lee, Kong Aik, and Chng, Eng Siong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real-world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large-scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%., Comment: 7 pages, to be presented at SLT 2024
Published: 2024

3. On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Author: Li, Junjie, Zhang, Ke, Wang, Shuai, Li, Haizhou, Mak, Man-Wai, and Lee, Kong Aik
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning technologies have significantly advanced the performance of target speaker extraction (TSE) tasks. To enhance the generalization and robustness of these algorithms when training data is insufficient, data augmentation is a commonly adopted technique. Unlike typical data augmentation applied to speech mixtures, this work thoroughly investigates the effectiveness of augmenting the enrollment speech space. We found that for both pretrained and jointly optimized speaker encoders, directly augmenting the enrollment speech leads to consistent performance improvement. In addition to conventional methods such as noise and reverberation addition, we propose a novel augmentation method called self-estimated speech augmentation (SSA). Experimental results on the Libri2Mix test set show that our proposed method can achieve an improvement of up to 2.5 dB., Comment: Accepted by SLT2024
Published: 2024

4. Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

Author: Liu, Tianchi, Kukanov, Ivan, Pan, Zihan, Wang, Qiongqiong, Sailor, Hardik B., and Lee, Kong Aik
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Sound
Abstract: The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach - Accent-based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual-trained models, improving their cross-lingual capabilities. We conduct experiments on a large-scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low-resource language scenarios., Comment: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024. Copyright may be transferred without notice, after which this version may no longer be accessible
Published: 2024

5. NPU-NTU System for Voice Privacy 2024 Challenge

Author: Yao, Jixun, Kuzmin, Nikita, Wang, Qing, Guo, Pengcheng, Ning, Ziqian, Guo, Dake, Lee, Kong Aik, Chng, Eng-Siong, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper, we describe our proposed speaker anonymization system for VPC 2024. Our system employs a disentangled neural codec architecture and a serial disentanglement strategy to gradually disentangle the global speaker identity and time-variant linguistic content and paralinguistic information. We introduce multiple distillation methods to disentangle linguistic content, speaker identity, and emotion. These methods include semantic distillation, supervised speaker distillation, and frame-level emotion distillation. Based on these distillations, we anonymize the original speaker identity using a weighted sum of a set of candidate speaker identities and a randomly generated speaker identity. Our system achieves the best trade-off of privacy protection and emotion preservation in VPC 2024., Comment: System description for VPC 2024
Published: 2024

6. Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

Author: Todisco, Massimiliano, Panariello, Michele, Wang, Xin, Delgado, Héctor, Lee, Kong Aik, and Evans, Nicholas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We present Malacopula, a neural-based generalised Hammerstein model designed to introduce adversarial perturbations to spoofed speech utterances so that they better deceive automatic speaker verification (ASV) systems. Using non-linear processes to modify speech utterances, Malacopula enhances the effectiveness of spoofing attacks. The model comprises parallel branches of polynomial functions followed by linear time-invariant filters. The adversarial optimisation procedure acts to minimise the cosine distance between speaker embeddings extracted from spoofed and bona fide utterances. Experiments, performed using three recent ASV systems and the ASVspoof 2019 dataset, show that Malacopula increases vulnerabilities by a substantial margin. However, speech quality is reduced and attacks can be detected effectively under controlled conditions. The findings emphasise the need to identify new vulnerabilities and design defences to protect ASV systems from adversarial attacks in the wild., Comment: Accepted at ASVspoof Workshop 2024
Published: 2024

7. ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Author: Wang, Xin, Delgado, Hector, Tak, Hemlata, Jung, Jee-weon, Shim, Hye-jin, Todisco, Massimiliano, Kukanov, Ivan, Liu, Xuechen, Sahidullah, Md, Kinnunen, Tomi, Evans, Nicholas, Lee, Kong Aik, and Yamagishi, Junichi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements., Comment: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)
Published: 2024

8. Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning

Author: Wang, Shuai, Chen, Zhengyang, Lee, Kong Aik, Qian, Yanmin, and Li, Haizhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this article, we aim to present, from a unique perspective, the developmental history, paradigm shifts, and application domains of speaker modeling technologies within the context of deep representation learning framework. This review is designed to provide a clear reference for researchers in the speaker modeling field, as well as for those who wish to apply speaker modeling techniques to specific downstream tasks.
Published: 2024

9. Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Author: Truong, Duc-Tuan, Tao, Ruijie, Nguyen, Tuan, Luong, Hieu-Thi, Lee, Kong Aik, and Chng, Eng Siong
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech., Comment: Accepted by INTERSPEECH 2024
Published: 2024
Full Text: View/download PDF

10. Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Author: Wang, Xin, Kinnunen, Tomi, Lee, Kong Aik, Noé, Paul-Gauthier, and Yamagishi, Junichi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Fusing outputs from automatic speaker verification (ASV) and spoofing countermeasure (CM) is expected to make an integrated system robust to zero-effort imposters and synthesized spoofing attacks. Many score-level fusion methods have been proposed, but many remain heuristic. This paper revisits score-level fusion using tools from decision theory and presents three main findings. First, fusion by summing the ASV and CM scores can be interpreted on the basis of compositional data analysis, and score calibration before fusion is essential. Second, the interpretation leads to an improved fusion method that linearly combines the log-likelihood ratios of ASV and CM. However, as the third finding reveals, this linear combination is inferior to a non-linear one in making optimal decisions. The outcomes of these findings, namely, the score calibration before fusion, improved linear fusion, and better non-linear fusion, were found to be effective on the SASV challenge database., Comment: Proceedings of Interspeech, DOI: 10.21437/Interspeech.2024-422. Code: https://github.com/nii-yamagishilab/SpeechSPC-mini
Published: 2024

11. Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Author: Wang, Rui, Chen, Liping, Lee, Kong AiK, and Ling, Zhen-Hua
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice anonymization has been developed as a technique for preserving privacy by replacing the speaker's voice in a speech signal with that of a pseudo-speaker, thereby obscuring the original voice attributes from machine recognition and human perception. In this paper, we focus on altering the voice attributes against machine recognition while retaining human perception. We referred to this as the asynchronous voice anonymization. To this end, a speech generation framework incorporating a speaker disentanglement mechanism is employed to generate the anonymized speech. The speaker attributes are altered through adversarial perturbation applied on the speaker embedding, while human perception is preserved by controlling the intensity of perturbation. Experiments conducted on the LibriSpeech dataset showed that the speaker attributes were obscured with their human perception preserved for 60.71% of the processed utterances., Comment: accpeted by Interspeech2024
Published: 2024

12. Cosine Scoring with Uncertainty for Neural Speaker Embedding

Author: Wang, Qiongqiong and Lee, Kong Aik
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice., Comment: 5 pages, 4 figures
Published: 2024
Full Text: View/download PDF

13. VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis

Author: Lin, Weiwei, He, Chenhang, Mak, Man-Wai, Lian, Jiachen, and Lee, Kong Aik
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}., Comment: preprint
Published: 2024

14. Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Author: Liu, Xuechen, Sahidullah, Md, Lee, Kong Aik, and Kinnunen, Tomi
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively., Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)
Published: 2024
Full Text: View/download PDF

15. Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio

Author: Ma, Yi, Lee, Kong Aik, Hautamäki, Ville, Ge, Meng, and Li, Haizhou
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is based on the property that the gradient indicates which parts of the input the model is paying attention to. Specifically, when the speaker network focuses on a region in the denoised utterance but not on the clean counterpart, we consider it artifact noise and assign higher weights for this region during optimization of enhancement. We validate it by training an enhancement model and testing the enhanced utterance on speaker verification. The experimental results show that our approach effectively reduces artifact noise, improving speaker verification across various SNR levels., Comment: Accepted by ICASSP 2024
Published: 2024

16. Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Author: Liu, Tianchi, Lee, Kong Aik, Wang, Qiongqiong, and Li, Haizhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model., Comment: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. Open Access: https://ieeexplore.ieee.org/abstract/document/10497864
Published: 2023

17. An Empirical Bayes Framework for Open-Domain Dialogue Generation

Author: Lee, Jing Yang, Lee, Kong Aik, and Gan, Woon-Seng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: To engage human users in meaningful conversation, open-domain dialogue agents are required to generate diverse and contextually coherent dialogue. Despite recent advancements, which can be attributed to the usage of pretrained language models, the generation of diverse and coherent dialogue remains an open research problem. A popular approach to address this issue involves the adaptation of variational frameworks. However, while these approaches successfully improve diversity, they tend to compromise on contextual coherence. Hence, we propose the Bayesian Open-domain Dialogue with Empirical Bayes (BODEB) framework, an empirical bayes framework for constructing an Bayesian open-domain dialogue agent by leveraging pretrained parameters to inform the prior and posterior parameter distributions. Empirical results show that BODEB achieves better results in terms of both diversity and coherence compared to variational frameworks.
Published: 2023

18. Partially Randomizing Transformer Weights for Dialogue Response Diversity

Author: Lee, Jing Yang, Lee, Kong Aik, and Gan, Woon-Seng
Subjects: Computer Science - Computation and Language
Abstract: Despite recent progress in generative open-domain dialogue, the issue of low response diversity persists. Prior works have addressed this issue via either novel objective functions, alternative learning approaches such as variational frameworks, or architectural extensions such as the Randomized Link (RL) Transformer. However, these approaches typically entail either additional difficulties during training/inference, or a significant increase in model size and complexity. Hence, we propose the \underline{Pa}rtially \underline{Ra}ndomized trans\underline{Former} (PaRaFormer), a simple extension of the transformer which involves freezing the weights of selected layers after random initialization. Experimental results reveal that the performance of the PaRaformer is comparable to that of the aforementioned approaches, despite not entailing any additional training difficulty or increase in model complexity.
Published: 2023

19. Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Author: Liu, Tianchi, Lee, Kong Aik, Wang, Qiongqiong, and Li, Haizhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use., Comment: Accepted to NeurIPS 2023 (main track)
Published: 2023

20. Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Author: Truong, Duc-Tuan, Tao, Ruijie, Yip, Jia Qi, Lee, Kong Aik, and Chng, Eng Siong
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods., Comment: Accepted by ICASSP 2024
Published: 2023
Full Text: View/download PDF

21. The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

Author: Liang, Yuhao, Shi, Mohan, Yu, Fan, Li, Yangze, Zhang, Shiliang, Du, Zhihao, Chen, Qian, Xie, Lei, Qian, Yanmin, Wu, Jian, Chen, Zhuo, Lee, Kong Aik, Yan, Zhijie, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. The open training condition sub-track, which allows for the use of all available data and models without limitation. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR., Comment: 8 pages, Accepted by ASRU2023
Published: 2023

22. t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Author: Kinnunen, Tomi, Lee, Kong Aik, Tak, Hemlata, Evans, Nicholas, and Nautsch, Andreas
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing, Statistics - Computation
Abstract: Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators., Comment: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence. For associated codes, see https://github.com/TakHemlata/T-EER (Github) and https://colab.research.google.com/drive/1ga7eiKFP11wOFMuZjThLJlkBcwEG6_4m?usp=sharing (Google Colab)
Published: 2023
Full Text: View/download PDF

23. Towards single integrated spoofing-aware speaker verification embeddings

Author: Mun, Sung Hwan, Shim, Hye-jin, Tak, Hemlata, Wang, Xin, Liu, Xuechen, Sahidullah, Md, Jeong, Myeonghun, Han, Min Hyun, Todisco, Massimiliano, Lee, Kong Aik, Yamagishi, Junichi, Evans, Nicholas, Kinnunen, Tomi, Kim, Nam Soo, and Jung, Jee-weon
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge., Comment: Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline
Published: 2023

24. Generalized domain adaptation framework for parametric back-end in speaker recognition

Author: Wang, Qiongqiong, Okabe, Koji, Lee, Kong Aik, and Koshinaka, Takafumi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: State-of-the-art speaker recognition systems comprise a speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) back-end. The effectiveness of these components relies on the availability of a large amount of labeled training data. In practice, it is common for domains (e.g., language, channel, demographic) in which a system is deployed to differ from that in which a system has been trained. To close the resulting gap, domain adaptation is often essential for PLDA models. Among two of its variants are Heavy-tailed PLDA (HT-PLDA) and Gaussian PLDA (G-PLDA). Though the former better fits real feature spaces than does the latter, its popularity has been severely limited by its computational complexity and, especially, by the difficulty, it presents in domain adaptation, which results from its non-Gaussian property. Various domain adaptation methods have been proposed for G-PLDA. This paper proposes a generalized framework for domain adaptation that can be applied to both of the above variants of PLDA for speaker recognition. It not only includes several existing supervised and unsupervised domain adaptation methods but also makes possible more flexible usage of available data in different domains. In particular, we introduce here two new techniques: (1) correlation-alignment in the model level, and (2) covariance regularization. To the best of our knowledge, this is the first proposed application of such techniques for domain adaptation w.r.t. HT-PLDA. The efficacy of the proposed techniques has been experimentally validated on NIST 2016, 2018, and 2019 Speaker Recognition Evaluation (SRE'16, SRE'18, and SRE'19) datasets.
Published: 2023

25. Speaker-Aware Anti-Spoofing

Author: Liu, Xuechen, Sahidullah, Md, Lee, Kong Aik, and Kinnunen, Tomi
Subjects: Computer Science - Sound, Computer Science - Cryptography and Security, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We address speaker-aware anti-spoofing, where prior knowledge of the target speaker is incorporated into a voice spoofing countermeasure (CM). In contrast to the frequently used speaker-independent solutions, we train the CM in a speaker-conditioned way. As a proof of concept, we consider speaker-aware extension to the state-of-the-art AASIST (audio anti-spoofing using integrated spectro-temporal graph attention networks) model. To this end, we consider two alternative strategies to incorporate target speaker information at the frame and utterance levels, respectively. The experimental results on a custom protocol based on ASVspoof 2019 dataset indicates the efficiency of the speaker information via enrollment: we obtain maximum relative improvements of 25.1% and 11.6% in equal error rate (EER) and minimum tandem detection cost function (t-DCF) over a speaker-independent baseline, respectively.
Published: 2023

26. Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Author: Liu, Meng, Lee, Kong Aik, Wang, Longbiao, Zhang, Hanyi, Zeng, Chang, and Dang, Jianwu
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.
Published: 2023

27. Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification

Author: Wang, Qiongqiong, Lee, Kong Aik, and Liu, Tianchi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Speech utterances recorded under differing conditions exhibit varying degrees of confidence in their embedding estimates, i.e., uncertainty, even if they are extracted using the same neural network. This paper aims to incorporate the uncertainty estimate produced in the xi-vector network front-end with a probabilistic linear discriminant analysis (PLDA) back-end scoring for speaker verification. To achieve this we derive a posterior covariance matrix, which measures the uncertainty, from the frame-wise precisions to the embedding space. We propose a log-likelihood ratio function for the PLDA scoring with the uncertainty propagation. We also propose to replace the length normalization pre-processing technique with a length scaling technique for the application of uncertainty propagation in the back-end. Experimental results on the VoxCeleb-1, SITW test sets as well as a domain-mismatched CNCeleb1-E set show the effectiveness of the proposed techniques with 14.5%-41.3% EER reductions and 4.6%-25.3% minDCF reductions., Comment: Accepted in ICASSP 2023 conference
Published: 2023

28. Probabilistic Back-ends for Online Speaker Recognition and Clustering

Author: Sholokhov, Alexey, Kuzmin, Nikita, Lee, Kong Aik, and Chng, Eng Siong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an extremely constrained version of probabilistic linear discriminant analysis (PLDA). The proposed model improves over the cosine scoring for multi-enrollment recognition while keeping the same performance in the case of one-to-one comparisons. Finally, we consider an online speaker clustering task where each step naturally involves multi-enrollment recognition. We propose an online clustering algorithm allowing us to take benefits from the PLDA model such as the ability to handle uncertainty and better score calibration. Our experiments demonstrate the effectiveness of the proposed algorithm., Comment: Accepted to ICASSP 2023
Published: 2023

29. I4U System Description for NIST SRE'20 CTS Challenge

Author: Lee, Kong Aik, Kinnunen, Tomi, Colibro, Daniele, Vair, Claudio, Nautsch, Andreas, Sun, Hanwu, He, Liang, Liang, Tianyu, Wang, Qiongqiong, Rouvier, Mickael, Bousquet, Pierre-Michel, Das, Rohan Kumar, Bailo, Ignacio Viñals, Liu, Meng, Deldago, Héctor, Liu, Xuechen, Sahidullah, Md, Cumani, Sandro, Zhang, Boning, Okabe, Koji, Yamamoto, Hitoshi, Tao, Ruijie, Li, Haizhou, Giménez, Alfonso Ortega, Wang, Longbiao, and Buera, Luis
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (China). The submission was based on the fusion of top performing sub-systems and sub-fusion systems contributed by individual teams. Efforts have been spent on the use of common development and validation sets, submission schedule and milestone, minimizing inconsistency in trial list and score file format across sites., Comment: SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker Recognition Challenge, 14-12 December 2021
Published: 2022

30. Speaker recognition with two-step multi-modal deep cleansing

Author: Tao, Ruijie, Lee, Kong Aik, Shi, Zhan, and Li, Haizhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Neural network-based speaker recognition has achieved significant improvement in recent years. A robust speaker representation learns meaningful knowledge from both hard and easy samples in the training set to achieve good performance. However, noisy samples (i.e., with wrong labels) in the training set induce confusion and cause the network to learn the incorrect representation. In this paper, we propose a two-step audio-visual deep cleansing framework to eliminate the effect of noisy labels in speaker representation learning. This framework contains a coarse-grained cleansing step to search for the peculiar samples, followed by a fine-grained cleansing step to filter out the noisy labels. Our study starts from an efficient audio-visual speaker recognition system, which achieves a close to perfect equal-error-rate (EER) of 0.01\%, 0.07\% and 0.13\% on the Vox-O, E and H test sets. With the proposed multi-modal cleansing mechanism, four different speaker recognition networks achieve an average improvement of 5.9\%. Code has been made available at: \textcolor{magenta}{\url{https://github.com/TaoRuijie/AVCleanse}}., Comment: 5 pages, 3 figures
Published: 2022

31. Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

Author: Tao, Ruijie, Lee, Kong Aik, Das, Rohan Kumar, Hautamäki, Ville, and Li, Haizhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of the speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89\%, 3.17\% and 6.27\% under the proposed progressive clustering strategy, and an EER of 1.44\%, 1.77\% and 3.27\% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on LRS2 and LRW datasets, where the speaker information is unknown. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets., Comment: 13 pages
Published: 2022

32. Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

Author: Liu, Xiaohui, Liu, Meng, Zhang, Lin, Zhang, Linjuan, Zeng, Chang, Li, Kai, Li, Nan, Lee, Kong Aik, Wang, Longbiao, and Dang, Jianwu
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively., Comment: 7 pages, 1 figures, Accecpted by Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia
Published: 2022
Full Text: View/download PDF

33. ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

Author: Liu, Xuechen, Wang, Xin, Sahidullah, Md, Patino, Jose, Delgado, Héctor, Kinnunen, Tomi, Todisco, Massimiliano, Yamagishi, Junichi, Evans, Nicholas, Nautsch, Andreas, and Lee, Kong Aik
Subjects: Computer Science - Sound, Computer Science - Cryptography and Security, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Benchmarking initiatives support the meaningful comparison of competing solutions to prominent problems in speech and language processing. Successive benchmarking evaluations typically reflect a progressive evolution from ideal lab conditions towards to those encountered in the wild. ASVspoof, the spoofing and deepfake detection initiative and challenge series, has followed the same trend. This article provides a summary of the ASVspoof 2021 challenge and the results of 54 participating teams that submitted to the evaluation phase. For the logical access (LA) task, results indicate that countermeasures are robust to newly introduced encoding and transmission effects. Results for the physical access (PA) task indicate the potential to detect replay attacks in real, as opposed to simulated physical spaces, but a lack of robustness to variations between simulated and real acoustic environments. The Deepfake (DF) task, new to the 2021 edition, targets solutions to the detection of manipulated, compressed speech data posted online. While detection solutions offer some resilience to compression effects, they lack generalization across different source datasets. In addition to a summary of the top-performing systems for each task, new analyses of influential data factors and results for hidden data subsets, the article includes a review of post-challenge results, an outline of the principal challenge limitations and a road-map for the future of ASVspoof., Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2022
Full Text: View/download PDF

34. The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

Author: Cheng, Gaofeng, Chen, Yifan, Yang, Runyan, Li, Qingxuan, Yang, Zehui, Ye, Lingxuan, Zhang, Pengyuan, Zhang, Qingqing, Xie, Lei, Qian, Yanmin, Lee, Kong Aik, and Yan, Yonghong
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The conversation scenario is one of the most important and most challenging scenarios for speech processing technologies because people in conversation respond to each other in a casual style. Detecting the speech activities of each person in a conversation is vital to downstream tasks, like natural language processing, machine translation, etc. People refer to the detection technology of "who speak when" as speaker diarization (SD). Traditionally, diarization error rate (DER) has been used as the standard evaluation metric of SD systems for a long time. However, DER fails to give enough importance to short conversational phrases, which are short but important on the semantic level. Also, a carefully and accurately manually-annotated testing dataset suitable for evaluating the conversational SD technologies is still unavailable in the speech community. In this paper, we design and describe the Conversational Short-phrases Speaker Diarization (CSSD) task, which consists of training and testing datasets, evaluation metric and baselines. In the dataset aspect, despite the previously open-sourced 180-hour conversational MagicData-RAMC dataset, we prepare an individual 20-hour conversational speech test dataset with carefully and artificially verified speakers timestamps annotations for the CSSD task. In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level. In the baseline aspect, we adopt a commonly used method: Variational Bayes HMM x-vector system, as the baseline of the CSSD task. Our evaluation metric is publicly available at https://github.com/SpeechClub/CDER_Metric., Comment: arXiv admin note: text overlap with arXiv:2203.16844
Published: 2022

35. Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion

Author: Shim, Hye-jin, Tak, Hemlata, Liu, Xuechen, Heo, Hee-Soo, Jung, Jee-weon, Chung, Joon Son, Chung, Soo-Whan, Yu, Ha-Jin, Lee, Bong-Jin, Todisco, Massimiliano, Delgado, Héctor, Lee, Kong Aik, Sahidullah, Md, Kinnunen, Tomi, and Evans, Nicholas
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning has brought impressive progress in the study of both automatic speaker verification (ASV) and spoofing countermeasures (CM). Although solutions are mutually dependent, they have typically evolved as standalone sub-systems whereby CM solutions are usually designed for a fixed ASV system. The work reported in this paper aims to gauge the improvements in reliability that can be gained from their closer integration. Results derived using the popular ASVspoof2019 dataset indicate that the equal error rate (EER) of a state-of-the-art ASV system degrades from 1.63% to 23.83% when the evaluation protocol is extended with spoofed trials.%subjected to spoofing attacks. However, even the straightforward integration of ASV and CM systems in the form of score-sum and deep neural network-based fusion strategies reduce the EER to 1.71% and 6.37%, respectively. The new Spoofing-Aware Speaker Verification (SASV) challenge has been formed to encourage greater attention to the integration of ASV and CM systems as well as to provide a means to benchmark different solutions., Comment: 8 pages, accepted by Odyssey 2022
Published: 2022

36. Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA?

Author: Wang, Qiongqiong, Lee, Kong Aik, and Liu, Tianchi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The emergence of large-margin softmax cross-entropy losses in training deep speaker embedding neural networks has triggered a gradual shift from parametric back-ends to a simpler cosine similarity measure for speaker verification. Popular parametric back-ends include the probabilistic linear discriminant analysis (PLDA) and its variants. This paper investigates the properties of margin-based cross-entropy losses leading to such a shift and aims to find scoring back-ends best suited for speaker verification. In addition, we revisit the pre-processing techniques which have been widely used in the past and assess their effectiveness on large-margin embeddings. Experiments on the state-of-the-art ECAPA-TDNN networks trained with various large-margin softmax cross-entropy losses show a substantial increment in intra-speaker compactness making the conventional PLDA superfluous. In this regard, we found that constraining the within-speaker covariance matrix could improve the performance of the PLDA. It is demonstrated through a series of experiments on the VoxCeleb-1 and SITW core-core test sets with 40.8% equal error rate (EER) reduction and 35.1% minimum detection cost (minDCF) reduction. It also outperforms cosine scoring consistently with reductions in EER and minDCF by 10.9% and 4.9%, respectively.
Published: 2022

37. Improving Contextual Coherence in Variational Personalized and Empathetic Dialogue Agents

Author: Lee, Jing Yang, Lee, Kong Aik, and Gan, Woon Seng
Subjects: Computer Science - Computation and Language
Abstract: In recent years, latent variable models, such as the Conditional Variational Auto Encoder (CVAE), have been applied to both personalized and empathetic dialogue generation. Prior work have largely focused on generating diverse dialogue responses that exhibit persona consistency and empathy. However, when it comes to the contextual coherence of the generated responses, there is still room for improvement. Hence, to improve the contextual coherence, we propose a novel Uncertainty Aware CVAE (UA-CVAE) framework. The UA-CVAE framework involves approximating and incorporating the aleatoric uncertainty during response generation. We apply our framework to both personalized and empathetic dialogue generation. Empirical results show that our framework significantly improves the contextual coherence of the generated response. Additionally, we introduce a novel automatic metric for measuring contextual coherence, which was found to correlate positively with human judgement., Comment: Accepted at ICASSP 2022
Published: 2022

38. Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Author: Yu, Fan, Zhang, Shiliang, Guo, Pengcheng, Fu, Yihui, Du, Zhihao, Zheng, Siqi, Huang, Weilong, Xie, Lei, Tan, Zheng-Hua, Wang, DeLiang, Qian, Yanmin, Lee, Kong Aik, Yan, Zhijie, Ma, Bin, Xu, Xin, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions., Comment: Accepted by ICASSP 2022
Published: 2022

39. MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

Author: Liu, Tianchi, Das, Rohan Kumar, Lee, Kong Aik, and Li, Haizhou
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN. We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and computation complexity. Further, the MFA mechanism is found to be effective for speaker verification with short test utterances., Comment: Accepted by ICASSP 2022
Published: 2022

40. DLVGen: A Dual Latent Variable Approach to Personalized Dialogue Generation

Author: Lee, Jing Yang, Lee, Kong Aik, and Gan, Woon Seng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: The generation of personalized dialogue is vital to natural and human-like conversation. Typically, personalized dialogue generation models involve conditioning the generated response on the dialogue history and a representation of the persona/personality of the interlocutor. As it is impractical to obtain the persona/personality representations for every interlocutor, recent works have explored the possibility of generating personalized dialogue by finetuning the model with dialogue examples corresponding to a given persona instead. However, in real-world implementations, a sufficient number of corresponding dialogue examples are also rarely available. Hence, in this paper, we propose a Dual Latent Variable Generator (DLVGen) capable of generating personalized dialogue in the absence of any persona/personality information or any corresponding dialogue examples. Unlike prior work, DLVGen models the latent distribution over potential responses as well as the latent distribution over the agent's potential persona. During inference, latent variables are sampled from both distributions and fed into the decoder. Empirical results show that DLVGen is capable of generating diverse responses which accurately incorporate the agent's persona., Comment: Accepted at ICAART 2022 as Full Paper
Published: 2021

41. Self-supervised Speaker Recognition with Loss-gated Learning

Author: Tao, Ruijie, Lee, Kong Aik, Das, Rohan Kumar, Hautamäki, Ville, and Li, Haizhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training. With the proposed LGL, our speaker recognition model obtains a $46.3\%$ performance gain over the system without it. Further, the proposed self-supervised speaker recognition with LGL trained on the VoxCeleb2 dataset without any labels achieves an equal error rate of $1.66\%$ on the VoxCeleb1 original test set. Code has been made available at: https://github.com/TaoRuijie/Loss-Gated-Learning., Comment: 5 pages, 3 figures
Published: 2021

42. PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction

Author: Ma, Yi, Lee, Kong Aik, Hautamaki, Ville, and Li, Haizhou
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. However, excessive suppression may lead to speech distortion and speaker information loss, which degrades the performance of speaker embedding extraction. To alleviate this problem, we propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction. This framework is optimized based on the feedback of the speaker identification task and the high-level perceptual deviation between the raw speech signal and its noisy version. We conducted speaker verification tasks in both noisy and clean environment respectively to evaluate our system. Compared to the baseline, our method shows better performance in both clean and noisy environments, which means our method can not only enhance the speaker relative information but also avoid adding distortions.
Published: 2021

43. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Author: Yamagishi, Junichi, Wang, Xin, Todisco, Massimiliano, Sahidullah, Md, Patino, Jose, Nautsch, Andreas, Liu, Xuechen, Lee, Kong Aik, Kinnunen, Tomi, Evans, Nicholas, and Delgado, Héctor
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Machine Learning, Computer Science - Sound
Abstract: ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years., Comment: Accepted to the ASVspoof 2021 Workshop
Published: 2021

44. ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

Author: Delgado, Héctor, Evans, Nicholas, Kinnunen, Tomi, Lee, Kong Aik, Liu, Xuechen, Nautsch, Andreas, Patino, Jose, Sahidullah, Md, Todisco, Massimiliano, Wang, Xin, and Yamagishi, Junichi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Machine Learning, Computer Science - Sound
Abstract: The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures. ASVspoof 2021 is the 4th in a series of bi-annual, competitive challenges where the goal is to develop countermeasures capable of discriminating between bona fide and spoofed or deepfake speech. This document provides a technical description of the ASVspoof 2021 challenge, including details of training, development and evaluation data, metrics, baselines, evaluation rules, submission procedures and the schedule., Comment: http://www.asvspoof.org
Published: 2021

45. Benchmarking and challenges in security and privacy for voice biometrics

Author: Bonastre, Jean-Francois, Delgado, Hector, Evans, Nicholas, Kinnunen, Tomi, Lee, Kong Aik, Liu, Xuechen, Nautsch, Andreas, Noe, Paul-Gauthier, Patino, Jose, Sahidullah, Md, Srivastava, Brij Mohan Lal, Todisco, Massimiliano, Tomashenko, Natalia, Vincent, Emmanuel, Wang, Xin, and Yamagishi, Junichi
Subjects: Computer Science - Cryptography and Security, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: For many decades, research in speech technologies has focused upon improving reliability. With this now meeting user expectations for a range of diverse applications, speech technology is today omni-present. As result, a focus on security and privacy has now come to the fore. Here, the research effort is in its relative infancy and progress calls for greater, multidisciplinary collaboration with security, privacy, legal and ethical experts among others. Such collaboration is now underway. To help catalyse the efforts, this paper provides a high-level overview of some related research. It targets the non-speech audience and describes the benchmarking methodology that has spearheaded progress in traditional research and which now drives recent security and privacy initiatives related to voice biometrics. We describe: the ASVspoof challenge relating to the development of spoofing countermeasures; the VoicePrivacy initiative which promotes research in anonymisation for privacy preservation., Comment: Submitted to the symposium of the ISCA Security & Privacy in Speech Communications (SPSC) special interest group
Published: 2021

46. Task-aware Warping Factors in Mask-based Speech Enhancement

Author: Wang, Qiongqiong, Lee, Kong Aik, Koshinaka, Takafumi, Okabe, Koji, and Yamamoto, Hitoshi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes the use of two task-aware warping factors in mask-based speech enhancement (SE). One controls the balance between speech-maintenance and noise-removal in training phases, while the other controls SE power applied to specific downstream tasks in testing phases. Our intention is to alleviate the problem that SE systems trained to improve speech quality often fail to improve other downstream tasks, such as automatic speaker verification (ASV) and automatic speech recognition (ASR), because they do not share the same objects. It is easy to apply the proposed dual-warping factors approach to any mask-based SE method, and it allows a single SE system to handle multiple tasks without task-dependent training. The effectiveness of our proposed approach has been confirmed on the SITW dataset for ASV evaluation and the LibriSpeech dataset for ASR and speech quality evaluations of 0-20dB. We show that different warping values are necessary for a single SE to achieve optimal performance w.r.t. the three tasks. With the use of task-dependent warping factors, speech quality was improved by an 84.7% PESQ increase, ASV had a 22.4% EER reduction, and ASR had a 52.2% WER reduction, on 0dB speech. The effectiveness of the task-dependent warping factors were also cross-validated on VoxCeleb-1 test set for ASV and LibriSpeech dev-clean set for ASV and quality evaluations. The proposed method is highly effective and easy to apply in practice., Comment: EUSIPCO 2021 (the 29th European Signal Processing Conference)
Published: 2021

47. Xi-Vector Embedding for Speaker Recognition

Author: Lee, Kong Aik, Wang, Qiongqiong, and Koshinaka, Takafumi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We present a Bayesian formulation for deep speaker embedding, wherein the xi-vector is the Bayesian counterpart of the x-vector, taking into account the uncertainty estimate. On the technology front, we offer a simple and straightforward extension to the now widely used x-vector. It consists of an auxiliary neural net predicting the frame-wise uncertainty of the input sequence. We show that the proposed extension leads to substantial improvement across all operating points, with a significant reduction in error rates and detection cost. On the theoretical front, our proposal integrates the Bayesian formulation of linear Gaussian model to speaker-embedding neural networks via the pooling layer. In one sense, our proposal integrates the Bayesian formulation of the i-vector to that of the x-vector. Hence, we refer to the embedding as the xi-vector, which is pronounced as /zai/ vector. Experimental results on the SITW evaluation set show a consistent improvement of over 17.5% in equal-error-rate and 10.9% in minimum detection cost.
Published: 2021

48. Generating Personalized Dialogue via Multi-Task Meta-Learning

Author: Lee, Jing Yang, Lee, Kong Aik, and Gan, Woon Seng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Conventional approaches to personalized dialogue generation typically require a large corpus, as well as predefined persona information. However, in a real-world setting, neither a large corpus of training data nor persona information are readily available. To address these practical limitations, we propose a novel multi-task meta-learning approach which involves training a model to adapt to new personas without relying on a large corpus, or on any predefined persona information. Instead, the model is tasked with generating personalized responses based on only the dialogue context. Unlike prior work, our approach leverages on the provided persona information only during training via the introduction of an auxiliary persona reconstruction task. In this paper, we introduce 2 frameworks that adopt the proposed multi-task meta-learning approach: the Multi-Task Meta-Learning (MTML) framework, and the Alternating Multi-Task Meta-Learning (AMTML) framework. Experimental results show that utilizing MTML and AMTML results in dialogue responses with greater persona consistency., Comment: Accepted at SemDial 2021 (PotsDial 2021)
Published: 2021

49. Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

Author: Zhu, Hongning, Lee, Kong Aik, and Li, Haizhou
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms to derive refined features that are more correlated with speakers. Serialized attention mechanism contains a stack of self-attention modules to create fixed-dimensional representations of speakers. Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner. In addition, we employ an input-aware query for each utterance with the statistics pooling. With more layers stacked, the neural network can learn more discriminative speaker embeddings. Experiment results on VoxCeleb1 dataset and SITW dataset show that our proposed method outperforms other baseline methods, including x-vectors and other x-vectors + conventional attentive pooling approaches by 9.7% in EER and 8.1% in DCF0.01., Comment: Accepted by Interspeech 2021
Published: 2021

50. Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification

Author: Zhang, Li, Wang, Qing, Lee, Kong Aik, Xie, Lei, and Li, Haizhou
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In far-field speaker verification, the performance of speaker embeddings is susceptible to degradation when there is a mismatch between the conditions of enrollment and test speech. To solve this problem, we propose the feature-level and instance-level transfer learning in the teacher-student framework to learn a domain-invariant embedding space. For the feature-level knowledge transfer, we develop the contrastive loss to transfer knowledge from teacher model to student model, which can not only decrease the intra-class distance, but also enlarge the inter-class distance. Moreover, we propose the instance-level pairwise distance transfer method to force the student model to preserve pairwise instances distance from the well optimized embedding space of the teacher model. On FFSVC 2020 evaluation set, our EER on Full-eval trials is relatively reduced by 13.9% compared with the fusion system result on Partial-eval trials of Task2. On Task1, compared with the winner's DenseNet result on Partial-eval trials, our minDCF on Full-eval trials is relatively reduced by 6.3%. On Task3, the EER and minDCF of our proposed method on Full-eval trials are very close to the result of the fusion system on Partial-eval trials. Our results also outperform other competitive domain adaptation methods.
Published: 2021

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

412 results on '"Lee, Kong Aik"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources