Author: "Lee, Kong Aik" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lee, Kong Aik"' showing total 413 results

Start Over Author "Lee, Kong Aik"

413 results on '"Lee, Kong Aik"'

51. Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification

Author: Zhang, Li, Wang, Qing, Lee, Kong Aik, Xie, Lei, and Li, Haizhou
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In far-field speaker verification, the performance of speaker embeddings is susceptible to degradation when there is a mismatch between the conditions of enrollment and test speech. To solve this problem, we propose the feature-level and instance-level transfer learning in the teacher-student framework to learn a domain-invariant embedding space. For the feature-level knowledge transfer, we develop the contrastive loss to transfer knowledge from teacher model to student model, which can not only decrease the intra-class distance, but also enlarge the inter-class distance. Moreover, we propose the instance-level pairwise distance transfer method to force the student model to preserve pairwise instances distance from the well optimized embedding space of the teacher model. On FFSVC 2020 evaluation set, our EER on Full-eval trials is relatively reduced by 13.9% compared with the fusion system result on Partial-eval trials of Task2. On Task1, compared with the winner's DenseNet result on Partial-eval trials, our minDCF on Full-eval trials is relatively reduced by 6.3%. On Task3, the EER and minDCF of our proposed method on Full-eval trials are very close to the result of the fusion system on Partial-eval trials. Our results also outperform other competitive domain adaptation methods.
Published: 2021

52. Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing

Author: Kinnunen, Tomi, Nautsch, Andreas, Sahidullah, Md, Evans, Nicholas, Wang, Xin, Todisco, Massimiliano, Delgado, Héctor, Yamagishi, Junichi, and Lee, Kong Aik
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Applications
Abstract: Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity. We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers in response to a common dataset. Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores and with close relation to receiver operating characteristic (ROC) and detection error trade-off (DET) analyses. While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems. The former are produced by a Gaussian mixture model system trained with VoxCeleb data whereas the latter stem from submissions to the ASVspoof 2019 challenge., Comment: Accepted to Interspeech 2021. Example code available at https://github.com/asvspoof-challenge/classifier-adjacency
Published: 2021

53. Exploring Deep Learning for Joint Audio-Visual Lip Biometrics

Author: Liu, Meng, Wang, Longbiao, Lee, Kong Aik, Zhang, Hanyi, Zeng, Chang, and Dang, Jianwu
Subjects: Computer Science - Multimedia, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a moderate-size database using existing public databases. Meanwhile, we establish the DeepLip AV lip biometrics system realized with a convolutional neural network (CNN) based video module, a time-delay neural network (TDNN) based audio module, and a multimodal fusion module. Our experiments show that DeepLip outperforms traditional speaker recognition models in context modeling and achieves over 50% relative improvements compared with our best single modality baseline, with an equal error rate of 0.75% and 1.11% on the test datasets, respectively.
Published: 2021

54. ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech

Author: Nautsch, Andreas, Wang, Xin, Evans, Nicholas, Kinnunen, Tomi, Vestman, Ville, Todisco, Massimiliano, Delgado, Héctor, Sahidullah, Md, Yamagishi, Junichi, and Lee, Kong Aik
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Sound
Abstract: The ASVspoof initiative was conceived to spearhead research in anti-spoofing for automatic speaker verification (ASV). This paper describes the third in a series of bi-annual challenges: ASVspoof 2019. With the challenge database and protocols being described elsewhere, the focus of this paper is on results and the top performing single and ensemble system submissions from 62 teams, all of which out-perform the two baseline systems, often by a substantial margin. Deeper analyses shows that performance is dominated by specific conditions involving either specific spoofing attacks or specific acoustic environments. While fusion is shown to be particularly effective for the logical access scenario involving speech synthesis and voice conversion attacks, participants largely struggled to apply fusion successfully for the physical access scenario involving simulated replay attacks. This is likely the result of a lack of system complementarity, while oracle fusion experiments show clear potential to improve performance. Furthermore, while results for simulated data are promising, experiments with real replay data show a substantial gap, most likely due to the presence of additive noise in the latter. This finding, among others, leads to a number of ideas for further research and directions for future editions of the ASVspoof challenge.
Published: 2021
Full Text: View/download PDF

55. Callsafe – the Vishing Barrier

Author: Ang, Swee Boon, Song, Samuel Yu Hao, Tan, Jing Yuan, Toh, Cheng Kiat Brendan, Tan, Jin Hao, Guo, Huaqun, Lee, Kong Aik, Yar, Kar Peo, Lu, Jiqiang, editor, Guo, Huaqun, editor, McLoughlin, Ian, editor, Chekole, Eyasu Getahun, editor, Lakshmanan, Umayal, editor, Meng, Weizhi, editor, Wang, Peng Cheng, editor, and Heng Loong Wong, Nicholas, editor
Published: 2023
Full Text: View/download PDF

56. Introduction to Voice Presentation Attack Detection and Recent Advances

Author: Sahidullah, Md, Delgado, Héctor, Todisco, Massimiliano, Nautsch, Andreas, Wang, Xin, Kinnunen, Tomi, Evans, Nicholas, Yamagishi, Junichi, Lee, Kong-Aik, Singh, Sameer, Founding Editor, Kang, Sing Bing, Series Editor, Bischof, Horst, Advisory Editor, Bowden, Richard, Advisory Editor, Dickinson, Sven, Advisory Editor, Jia, Jiaya, Advisory Editor, Lee, Kyoung Mu, Advisory Editor, Lin, Zhouchen, Advisory Editor, Sato, Yoichi, Advisory Editor, Schiele, Bernt, Advisory Editor, Sclaroff, Stan, Advisory Editor, Marcel, Sébastien, editor, Fierrez, Julian, editor, and Evans, Nicholas, editor
Published: 2023
Full Text: View/download PDF

57. Using Multi-Resolution Feature Maps with Convolutional Neural Networks for Anti-Spoofing in ASV

Author: Wang, Qiongqiong, Lee, Kong Aik, and Koshinaka, Takafumi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper presents a simple but effective method that uses multi-resolution feature maps with convolutional neural networks (CNNs) for anti-spoofing in automatic speaker verification (ASV). The central idea is to alleviate the problem that the feature maps commonly used in anti-spoofing networks are insufficient for building discriminative representations of audio segments, as they are often extracted by a single-length sliding window. Resulting trade-offs between time and frequency resolutions restrict the information in single spectrograms. The proposed method improves both frequency resolution and time resolution by stacking multiple spectrograms that are extracted using different window lengths. These are fed into a convolutional neural network in the form of multiple channels, making it possible to extract more information from input signals while only marginally increasing computational costs. The efficiency of the proposed method has been conformed on the ASVspoof 2019 database. We show that the use of the proposed multiresolution inputs consistently outperforms that of score fusion across different CNN architectures. Moreover, computational cost remains small., Comment: Odyssey 2020 (The Speaker and Language Recognition Workshop)
Published: 2020

58. A Generalized Framework for Domain Adaptation of PLDA in Speaker Recognition

Author: Wang, Qiongqiong, Okabe, Koji, Lee, Kong Aik, and Koshinaka, Takafumi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper proposes a generalized framework for domain adaptation of Probabilistic Linear Discriminant Analysis (PLDA) in speaker recognition. It not only includes several existing supervised and unsupervised domain adaptation methods but also makes possible more flexible usage of available data in different domains. In particular, we introduce here the two new techniques described below. (1) Correlation-alignment-based interpolation and (2) covariance regularization. The proposed correlation-alignment-based interpolation method decreases minCprimary up to 30.5% as compared with that from an out-of-domain PLDA model before adaptation, and minCprimary is also 5.5% lower than with a conventional linear interpolation method with optimal interpolation weights. Further, the proposed regularization technique ensures robustness in interpolations w.r.t. varying interpolation weights, which in practice is essential., Comment: ICASSP 2020 (45th International Conference on Acoustics, Speech, and Signal Processing)
Published: 2020

59. Extrapolating false alarm rates in automatic speaker verification

Author: Sholokhov, Alexey, Kinnunen, Tomi, Vestman, Ville, and Lee, Kong Aik
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Automatic speaker verification (ASV) vendors and corpus providers would both benefit from tools to reliably extrapolate performance metrics for large speaker populations without collecting new speakers. We address false alarm rate extrapolation under a worst-case model whereby an adversary identifies the closest impostor for a given target speaker from a large population. Our models are generative and allow sampling new speakers. The models are formulated in the ASV detection score space to facilitate analysis of arbitrary ASV systems., Comment: Accepted for publication to Interspeech 2020
Published: 2020

60. Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals

Author: Kinnunen, Tomi, Delgado, Héctor, Evans, Nicholas, Lee, Kong Aik, Vestman, Ville, Nautsch, Andreas, Todisco, Massimiliano, Wang, Xin, Sahidullah, Md, Yamagishi, Junichi, and Reynolds, Douglas A.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs. The reliability of spoofing CMs is typically gauged using the equal error rate (EER) metric. The primitive EER fails to reflect application requirements and the impact of spoofing and CMs upon ASV and its use as a primary metric in traditional ASV research has long been abandoned in favour of risk-based approaches to assessment. This paper presents several new extensions to the tandem detection cost function (t-DCF), a recent risk-based approach to assess the reliability of spoofing CMs deployed in tandem with an ASV system. Extensions include a simplified version of the t-DCF with fewer parameters, an analysis of a special case for a fixed ASV system, simulations which give original insights into its interpretation and new analyses using the ASVspoof 2019 database. It is hoped that adoption of the t-DCF for the CM assessment will help to foster closer collaboration between the anti-spoofing and ASV research communities., Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)
Published: 2020
Full Text: View/download PDF

61. Neural i-vectors

Author: Vestman, Ville, Lee, Kong Aik, and Kinnunen, Tomi H.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Deep speaker embeddings have been demonstrated to outperform their generative counterparts, i-vectors, in recent speaker verification evaluations. To combine the benefits of high performance and generative interpretation, we investigate the use of deep embedding extractor and i-vector extractor in succession. To bundle the deep embedding extractor with an i-vector extractor, we adopt aggregation layers inspired by the Gaussian mixture model (GMM) to the embedding extractor networks. The inclusion of GMM-like layer allows the discriminatively trained network to be used as a provider of sufficient statistics for the i-vector extractor to extract what we call neural i-vectors. We compare the deep embeddings to the proposed neural i-vectors on the Speakers in the Wild (SITW) and the Speaker Recognition Evaluation (SRE) 2018 and 2019 datasets. On the core-core condition of SITW, our deep embeddings obtain performance comparative to the state-of-the-art. The neural i-vectors obtain about 50% worse performance than the deep embeddings, but on the other hand outperform the previous i-vector approaches reported in the literature by a clear margin., Comment: Accepted to Odyssey 2020: The Speaker and Language Recognition Workshop. Version 2 (bugfix)
Published: 2020

62. Short-duration Speaker Verification (SdSV) Challenge 2021: the Challenge Evaluation Plan

Author: Zeinali, Hossein, Lee, Kong Aik, Alam, Jahangir, and Burget, Lukas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: This document describes the Short-duration Speaker Verification (SdSV) Challenge 2021. The main goal of the challenge is to evaluate new technologies for text-dependent (TD) and text-independent (TI) speaker verification (SV) in a short duration scenario. The proposed challenge evaluates SdSV with varying degree of phonetic overlap between the enrollment and test utterances (cross-lingual). It is the first challenge with a broad focus on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker recognition. We expect that modern methods (deep neural networks in particular) will play a key role.
Published: 2019

63. Speaker detection in the wild: Lessons learned from JSALT 2019

Author: Garcia, Paola, Villalba, Jesus, Bredin, Herve, Du, Jun, Castan, Diego, Cristia, Alejandrina, Bullock, Latane, Guo, Ling, Okabe, Koji, Nidadavolu, Phani Sankar, Kataria, Saurabh, Chen, Sizhu, Galmant, Leo, Lavechin, Marvin, Sun, Lei, Gill, Marie-Philippe, Ben-Yair, Bar, Abdoli, Sajjad, Wang, Xin, Bouaziz, Wassim, Titeux, Hadrien, Dupoux, Emmanuel, Lee, Kong Aik, and Dehak, Najim
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection., Comment: Submitted to ICASSP 2020
Published: 2019

64. Voice Biometrics Security: Extrapolating False Alarm Rate via Hierarchical Bayesian Modeling of Speaker Verification Scores

Author: Sholokhov, Alexey, Kinnunen, Tomi, Vestman, Ville, and Lee, Kong Aik
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Statistics - Machine Learning
Abstract: How secure automatic speaker verification (ASV) technology is? More concretely, given a specific target speaker, how likely is it to find another person who gets falsely accepted as that target? This question may be addressed empirically by studying naturally confusable pairs of speakers within a large enough corpus. To this end, one might expect to find at least some speaker pairs that are indistinguishable from each other in terms of ASV. To a certain extent, such aim is mirrored in the standardized ASV evaluation benchmarks. However, the number of speakers in such evaluation benchmarks represents only a small fraction of all possible human voices, making it challenging to extrapolate performance beyond a given corpus. Furthermore, the impostors used in performance evaluation are usually selected randomly. A potentially more meaningful definition of an impostor - at least in the context of security-driven ASV applications - would be closest (most confusable) other speaker to a given target. We put forward a novel performance assessment framework to address both the inadequacy of the random-impostor evaluation model and the size limitation of evaluation corpora by addressing ASV security against closest impostors on arbitrarily large datasets. The framework allows one to make a prediction of the safety of given ASV technology, in its current state, for arbitrarily large speaker database size consisting of virtual (sampled) speakers. As a proof-of-concept, we analyze the performance of two state-of-the-art ASV systems, based on i-vector and x-vector speaker embeddings (as implemented in the popular Kaldi toolkit), on the recent VoxCeleb 1 & 2 corpora. We found that neither the i-vector or x-vector system is immune to increased false alarm rate at increased impostor database size., Comment: Accepted to be published in Computer Speech and Language
Published: 2019

65. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Author: Wang, Xin, Yamagishi, Junichi, Todisco, Massimiliano, Delgado, Hector, Nautsch, Andreas, Evans, Nicholas, Sahidullah, Md, Vestman, Ville, Kinnunen, Tomi, Lee, Kong Aik, Juvela, Lauri, Alku, Paavo, Peng, Yu-Huai, Hwang, Hsin-Te, Tsao, Yu, Wang, Hsin-Min, Maguer, Sebastien Le, Becker, Markus, Henderson, Fergus, Clark, Rob, Zhang, Yu, Wang, Quan, Jia, Ye, Onuma, Kai, Mushika, Koji, Kaneda, Takashi, Jiang, Yuan, Liu, Li-Juan, Wu, Yi-Chiao, Huang, Wen-Chin, Toda, Tomoki, Tanaka, Kou, Kameoka, Hirokazu, Steiner, Ingmar, Matrouf, Driss, Bonastre, Jean-Francois, Govender, Avashna, Ronanki, Srikanth, Zhang, Jing-Xuan, and Ling, Zhen-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects., Comment: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114
Published: 2019

66. Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration

Author: Vestman, Ville, Lee, Kong Aik, Kinnunen, Tomi H., and Koshinaka, Takafumi
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Speaker embeddings are continuous-value vector representations that allow easy comparison between voices of speakers with simple geometric operations. Among others, i-vector and x-vector have emerged as the mainstream methods for speaker embedding. In this paper, we illustrate the use of modern computation platform to harness the benefit of GPU acceleration for i-vector extraction. In particular, we achieve an acceleration of 3000 times in frame posterior computation compared to real time and 25 times in training the i-vector extractor compared to the CPU baseline from Kaldi toolkit. This significant speed-up allows the exploration of ideas that were hitherto impossible. In particular, we show that it is beneficial to update the universal background model (UBM) and re-compute frame alignments while training the i-vector extractor. Additionally, we are able to study different variations of i-vector extractors more rigorously than before. In this process, we reveal some undocumented details of Kaldi's i-vector extractor and show that it outperforms the standard formulation by a margin of 1 to 2% when tested with VoxCeleb speaker verification protocol. All of our findings are asserted by ensemble averaging the results from multiple runs with random start., Comment: Accepted to Interspeech 2019
Published: 2019

67. I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

Author: Lee, Kong Aik, Hautamaki, Ville, Kinnunen, Tomi, Yamamoto, Hitoshi, Okabe, Koji, Vestman, Ville, Huang, Jing, Ding, Guohong, Sun, Hanwu, Larcher, Anthony, Das, Rohan Kumar, Li, Haizhou, Rouvier, Mickael, Bousquet, Pierre-Michel, Rao, Wei, Wang, Qing, Zhang, Chunlei, Bahmaninezhad, Fahimeh, Delgado, Hector, Patino, Jose, Wang, Qiongqiong, Guo, Ling, Koshinaka, Takafumi, Zhang, Jiacen, Shinoda, Koichi, Trong, Trung Ngo, Sahidullah, Md, Lu, Fan, Tang, Yun, Tu, Ming, Teh, Kah Kuan, Tran, Huy Dat, George, Kuruvachan K., Kukanov, Ivan, Desnous, Florent, Yang, Jichen, Yilmaz, Emre, Xu, Longting, Bonastre, Jean-Francois, Xu, Chenglin, Lim, Zhi Hao, Chng, Eng Siong, Ranjan, Shivesh, Hansen, John H. L., Todisco, Massimiliano, and Evans, Nicholas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation., Comment: 5 pages
Published: 2019

68. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

Author: Todisco, Massimiliano, Wang, Xin, Vestman, Ville, Sahidullah, Md, Delgado, Hector, Nautsch, Andreas, Yamagishi, Junichi, Evans, Nicholas, Kinnunen, Tomi, and Lee, Kong Aik
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Sound
Abstract: ASVspoof, now in its third edition, is a series of community-led challenges which promote the development of countermeasures to protect automatic speaker verification (ASV) from the threat of spoofing. Advances in the 2019 edition include: (i) a consideration of both logical access (LA) and physical access (PA) scenarios and the three major forms of spoofing attack, namely synthetic, converted and replayed speech; (ii) spoofing attacks generated with state-of-the-art neural acoustic and waveform models; (iii) an improved, controlled simulation of replay attacks; (iv) use of the tandem detection cost function (t-DCF) that reflects the impact of both spoofing and countermeasures upon ASV reliability. Even if ASV remains the core focus, in retaining the equal error rate (EER) as a secondary metric, ASYspoof also embraces the growing importance of fake audio detection. ASVspoof 2019 attracted the participation of 63 research teams, with more than half of these reporting systems that improve upon the performance of two baseline spoofing countermeasures. This paper describes the 2019 database, protocols and challenge results. It also outlines major findings which demonstrate the real progress made in protecting against the threat of spoofing and fake audio.
Published: 2019

69. Introduction to Voice Presentation Attack Detection and Recent Advances

Author: Sahidullah, Md, Delgado, Hector, Todisco, Massimiliano, Kinnunen, Tomi, Evans, Nicholas, Yamagishi, Junichi, and Lee, Kong-Aik
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Over the past few years significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV). This includes the development of new speech corpora, standard evaluation protocols and advancements in front-end feature extraction and back-end classifiers. The use of standard databases and evaluation protocols has enabled for the first time the meaningful benchmarking of different PAD solutions. This chapter summarises the progress, with a focus on studies completed in the last three years. The article presents a summary of findings and lessons learned from two ASVspoof challenges, the first community-led benchmarking efforts. These show that ASV PAD remains an unsolved problem and that further attention is required to develop generalised PAD solutions which have potential to detect diverse and previously unseen spoofing attacks., Comment: Published as a book-chapter in Handbook of Biometric Anti-Spoofing Presentation Attack Detection (Second Edition)
Published: 2019

70. The CORAL+ Algorithm for Unsupervised Domain Adaptation of PLDA

Author: Lee, Kong Aik, Wang, Qiongqiong, and Koshinaka, Takafumi
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: State-of-the-art speaker recognition systems comprise an x-vector (or i-vector) speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) backend. The effectiveness of these components relies on the availability of a large collection of labeled training data. In practice, it is common that the domains (e.g., language, demographic) in which the system are deployed differs from that we trained the system. To close the gap due to the domain mismatch, we propose an unsupervised PLDA adaptation algorithm to learn from a small amount of unlabeled in-domain data. The proposed method was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL). We refer to the model-based adaptation technique proposed in this paper as CORAL+. The efficacy of the proposed technique is experimentally validated on the recent NIST 2016 and 2018 Speaker Recognition Evaluation (SRE'16, SRE'18) datasets., Comment: 5 pages
Published: 2018

71. Introduction to Voice Presentation Attack Detection and Recent Advances

Author: Sahidullah, Md, primary, Delgado, Héctor, additional, Todisco, Massimiliano, additional, Nautsch, Andreas, additional, Wang, Xin, additional, Kinnunen, Tomi, additional, Evans, Nicholas, additional, Yamagishi, Junichi, additional, and Lee, Kong-Aik, additional
Published: 2023
Full Text: View/download PDF

72. Noise-Robust Semi-supervised Multi-modal Machine Translation

Author: Li, Lin, Hu, Kaixi, Tayir, Turghun, Liu, Jianquan, Lee, Kong Aik, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Khanna, Sankalp, editor, Cao, Jian, editor, Bai, Quan, editor, and Xu, Guandong, editor
Published: 2022
Full Text: View/download PDF

73. A Dual Latent Variable Personalized Dialogue Agent

Author: Lee, Jing Yang, Lee, Kong Aik, and Gan, Woon Seng
Published: 2023
Full Text: View/download PDF

74. Attention Mechanism in Speaker Recognition: What Does It Learn in Deep Speaker Embedding?

Author: Wang, Qiongqiong, Okabe, Koji, Lee, Kong Aik, Yamamoto, Hitoshi, and Koshinaka, Takafumi
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents an experimental study on deep speaker embedding with an attention mechanism that has been found to be a powerful representation learning technique in speaker recognition. In this framework, an attention model works as a frame selector that computes an attention weight for each frame-level feature vector, in accord with which an utterancelevel representation is produced at the pooling layer in a speaker embedding network. In general, an attention model is trained together with the speaker embedding network on a single objective function, and thus those two components are tightly bound to one another. In this paper, we consider the possibility that the attention model might be decoupled from its parent network and assist other speaker embedding networks and even conventional i-vector extractors. This possibility is demonstrated through a series of experiments on a NIST Speaker Recognition Evaluation (SRE) task, with 9.0% EER reduction and 3.8% min_Cprimary reduction when the attention weights are applied to i-vector extraction. Another experiment shows that DNN-based soft voice activity detection (VAD) can be effectively combined with the attention mechanism to yield further reduction of minCprimary by 6.6% and 1.6% in deep speaker embedding and i-vector systems, respectively., Comment: SLT 2018 (Workshop on Spoken Language Technology)
Published: 2018

75. t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification

Author: Kinnunen, Tomi, Lee, Kong Aik, Delgado, Hector, Evans, Nicholas, Todisco, Massimiliano, Sahidullah, Md, Yamagishi, Junichi, and Reynolds, Douglas A.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Sound, Statistics - Machine Learning
Abstract: The ASVspoof challenge series was born to spearhead research in anti-spoofing for automatic speaker verification (ASV). The two challenge editions in 2015 and 2017 involved the assessment of spoofing countermeasures (CMs) in isolation from ASV using an equal error rate (EER) metric. While a strategic approach to assessment at the time, it has certain shortcomings. First, the CM EER is not necessarily a reliable predictor of performance when ASV and CMs are combined. Second, the EER operating point is ill-suited to user authentication applications, e.g. telephone banking, characterised by a high target user prior but a low spoofing attack prior. We aim to migrate from CM- to ASV-centric assessment with the aid of a new tandem detection cost function (t-DCF) metric. It extends the conventional DCF used in ASV research to scenarios involving spoofing attacks. The t-DCF metric has 6 parameters: (i) false alarm and miss costs for both systems, and (ii) prior probabilities of target and spoof trials (with an implied third, nontarget prior). The study is intended to serve as a self-contained, tutorial-like presentation. We analyse with the t-DCF a selection of top-performing CM submissions to the 2015 and 2017 editions of ASVspoof, with a focus on the spoofing attack prior. Whereas there is little to choose between countermeasure systems for lower priors, system rankings derived with the EER and t-DCF show differences for higher priors. We observe some ranking changes. Findings support the adoption of the DCF-based metric into the roadmap for future ASVspoof challenges, and possibly for other biometric anti-spoofing evaluations., Comment: Published in Odyssey 2018: the Speaker and Language Recognition Workshop [cleaned up source files]
Published: 2018

76. CPAUG: Refining Copy-Paste Augmentation for Speech Anti-Spoofing

Author: Zhang, Linjuan, primary, Lee, Kong Aik, additional, Zhang, Lin, additional, Wang, Longbiao, additional, and Niu, Baoning, additional
Published: 2024
Full Text: View/download PDF

77. Modeling Pseudo-Speaker Uncertainty in Voice Anonymization

Author: Chen, Liping, primary, Lee, Kong Aik, additional, Guo, Wu, additional, and Ling, Zhen-Hua, additional
Published: 2024
Full Text: View/download PDF

78. Gradient Weighting for Speaker Verification in Extremely Low Signal-to-Noise Ratio

Author: Ma, Yi, primary, Lee, Kong Aik, additional, Hautamäki, Ville, additional, Ge, Meng, additional, and Li, Haizhou, additional
Published: 2024
Full Text: View/download PDF

79. Deep Discriminative Embedding with Ranked Weight for Speaker Verification

Author: Zhou, Dao, Wang, Longbiao, Lee, Kong Aik, Liu, Meng, Dang, Jianwu, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Yang, Haiqin, editor, Pasupa, Kitsuchart, editor, Leung, Andrew Chi-Sing, editor, Kwok, James T., editor, Chan, Jonathan H., editor, and King, Irwin, editor
Published: 2020
Full Text: View/download PDF

80. Fantastic 4 system for NIST 2015 Language Recognition Evaluation

Author: Lee, Kong Aik, Hautamäki, Ville, Larcher, Anthony, Rao, Wei, Sun, Hanwu, Nguyen, Trung Hieu, Wang, Guangsen, Sizov, Aleksandr, Kukanov, Ivan, Poorjam, Amir, Trong, Trung Ngo, Xiao, Xiong, Xu, Cheng-Lin, Xu, Hai-Hua, Ma, Bin, Li, Haizhou, and Meignier, Sylvain
Subjects: Computer Science - Computation and Language
Abstract: This article describes the systems jointly submitted by Institute for Infocomm (I$^2$R), the Laboratoire d'Informatique de l'Universit\'e du Maine (LIUM), Nanyang Technology University (NTU) and the University of Eastern Finland (UEF) for 2015 NIST Language Recognition Evaluation (LRE). The submitted system is a fusion of nine sub-systems based on i-vectors extracted from different types of features. Given the i-vectors, several classifiers are adopted for the language detection task including support vector machines (SVM), multi-class logistic regression (MCLR), Probabilistic Linear Discriminant Analysis (PLDA) and Deep Neural Networks (DNN)., Comment: Technical report for NIST LRE 2015 Workshop
Published: 2016

81. Replay attack detection using variable-frequency resolution phase and magnitude features

Author: Liu, Meng, Wang, Longbiao, Dang, Jianwu, Lee, Kong Aik, and Nakagawa, Seiichi
Published: 2021
Full Text: View/download PDF

82. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Author: Wang, Xin, Yamagishi, Junichi, Todisco, Massimiliano, Delgado, Héctor, Nautsch, Andreas, Evans, Nicholas, Sahidullah, Md, Vestman, Ville, Kinnunen, Tomi, Lee, Kong Aik, Juvela, Lauri, Alku, Paavo, Peng, Yu-Huai, Hwang, Hsin-Te, Tsao, Yu, Wang, Hsin-Min, Maguer, Sébastien Le, Becker, Markus, Henderson, Fergus, Clark, Rob, Zhang, Yu, Wang, Quan, Jia, Ye, Onuma, Kai, Mushika, Koji, Kaneda, Takashi, Jiang, Yuan, Liu, Li-Juan, Wu, Yi-Chiao, Huang, Wen-Chin, Toda, Tomoki, Tanaka, Kou, Kameoka, Hirokazu, Steiner, Ingmar, Matrouf, Driss, Bonastre, Jean-François, Govender, Avashna, Ronanki, Srikanth, Zhang, Jing-Xuan, and Ling, Zhen-Hua
Published: 2020
Full Text: View/download PDF

83. NEC-TT System for Mixed-Bandwidth and Multi-Domain Speaker Recognition

Author: Lee, Kong Aik, Yamamoto, Hitoshi, Okabe, Koji, Wang, Qiongqiong, Guo, Ling, Koshinaka, Takafumi, Zhang, Jiacen, and Shinoda, Koichi
Published: 2020
Full Text: View/download PDF

84. Voice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores

Author: Sholokhov, Alexey, Kinnunen, Tomi, Vestman, Ville, and Lee, Kong Aik
Published: 2020
Full Text: View/download PDF

85. Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Author: Liu, Tianchi, primary, Lee, Kong Aik, additional, Wang, Qiongqiong, additional, and Li, Haizhou, additional
Published: 2024
Full Text: View/download PDF

86. Cosine Scoring With Uncertainty for Neural Speaker Embedding

Author: Wang, Qiongqiong, primary and Lee, Kong Aik, additional
Published: 2024
Full Text: View/download PDF

87. Encoder-Decoder Calibration for Multimodal Machine Translation

Author: Tayir, Turghun, primary, Li, Lin, additional, Li, Bei, additional, Liu, Jianquan, additional, and Lee, Kong Aik, additional
Published: 2024
Full Text: View/download PDF

88. The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASR

Author: Liang, Yuhao, primary, Shi, Mohan, additional, Yu, Fan, additional, Li, Yangze, additional, Zhang, Shiliang, additional, Du, Zhihao, additional, Chen, Qian, additional, Xie, Lei, additional, Qian, Yanmin, additional, Wu, Jian, additional, Chen, Zhuo, additional, Lee, Kong Aik, additional, Yan, Zhijie, additional, and Bu, Hui, additional
Published: 2023
Full Text: View/download PDF

89. Towards Single Integrated Spoofing-aware Speaker Verification Embeddings

Author: Mun, Sung Hwan, primary, Shim, Hye-jin, additional, Tak, Hemlata, additional, Wang, Xin, additional, Liu, Xuechen, additional, Sahidullah, Md, additional, Jeong, Myeonghun, additional, Han, Min Hyun, additional, Todisco, Massimiliano, additional, Lee, Kong Aik, additional, Yamagishi, Junichi, additional, Evans, Nicholas, additional, Kinnunen, Tomi, additional, Kim, Nam Soo, additional, and Jung, Jee-weon, additional
Published: 2023
Full Text: View/download PDF

90. Speaker-Aware Anti-spoofing

Author: Liu, Xuechen, primary, Sahidullah, Md, additional, Lee, Kong Aik, additional, and Kinnunen, Tomi, additional
Published: 2023
Full Text: View/download PDF

91. Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection

Author: Liu, Xiaohui, primary, Liu, Meng, additional, Wang, Longbiao, additional, Lee, Kong Aik, additional, Zhang, Hanyi, additional, and Dang, Jianwu, additional
Published: 2023
Full Text: View/download PDF

92. Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning

Author: Chen, Hui, primary, Zhang, Hanyi, additional, Wang, Longbiao, additional, Lee, Kong Aik, additional, Liu, Meng, additional, and Dang, Jianwu, additional
Published: 2023
Full Text: View/download PDF

93. Noise-Disentanglement Metric Learning for Robust Speaker Verification

Author: Sun, Yao, primary, Zhang, Hanyi, additional, Wang, Longbiao, additional, Lee, Kong Aik, additional, Liu, Meng, additional, and Dang, Jianwu, additional
Published: 2023
Full Text: View/download PDF

94. Speaker Recognition with Two-Step Multi-Modal Deep Cleansing

Author: Tao, Ruijie, primary, Lee, Kong Aik, additional, Shi, Zhan, additional, and Li, Haizhou, additional
Published: 2023
Full Text: View/download PDF

95. Probabilistic Back-ends for Online Speaker Recognition and Clustering

Author: Sholokhov, Alexey, primary, Kuzmin, Nikita, additional, Lee, Kong Aik, additional, and Chng, Eng Siong, additional
Published: 2023
Full Text: View/download PDF

96. Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification

Author: Liu, Meng, primary, Lee, Kong Aik, additional, Wang, Longbiao, additional, Zhang, Hanyi, additional, Zeng, Chang, additional, and Dang, Jianwu, additional
Published: 2023
Full Text: View/download PDF

97. Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification

Author: Wang, Qiongqiong, primary, Lee, Kong Aik, additional, and Liu, Tianchi, additional
Published: 2023
Full Text: View/download PDF

98. Introduction to Voice Presentation Attack Detection and Recent Advances

Author: Sahidullah, Md, primary, Delgado, Héctor, additional, Todisco, Massimiliano, additional, Kinnunen, Tomi, additional, Evans, Nicholas, additional, Yamagishi, Junichi, additional, and Lee, Kong-Aik, additional
Published: 2019
Full Text: View/download PDF

99. A Comparison of Categorical Attribute Data Clustering Methods

Author: Hautamäki, Ville, Pöllänen, Antti, Kinnunen, Tomi, Lee, Kong Aik, Li, Haizhou, Fränti, Pasi, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Kobsa, Alfred, Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Fränti, Pasi, editor, Brown, Gavin, editor, Loog, Marco, editor, Escolano, Francisco, editor, and Pelillo, Marcello, editor
Published: 2014
Full Text: View/download PDF

100. Unifying Probabilistic Linear Discriminant Analysis Variants in Biometric Authentication

Author: Sizov, Aleksandr, Lee, Kong Aik, Kinnunen, Tomi, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Kobsa, Alfred, Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Fränti, Pasi, editor, Brown, Gavin, editor, Loog, Marco, editor, Escolano, Francisco, editor, and Pelillo, Marcello, editor
Published: 2014
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

413 results on '"Lee, Kong Aik"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources