Author: "Tomi Kinnunen" / Search Limiters: Available in Library Collection - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tomi Kinnunen"' showing total 131 results

Start Over Author "Tomi Kinnunen" Search Limiters Available in Library Collection

131 results on '"Tomi Kinnunen"'

1. ASVtorch toolkit: Speaker verification with deep neural networks

Author: Kong Aik Lee, Ville Vestman, and Tomi Kinnunen
Subjects: Speaker recognition, PyTorch, Deep learning, Computer software, QA76.75-76.765
Abstract: The human voice differs substantially between individuals. This facilitates automatic speaker verification (ASV) — recognizing a person from his/her voice. ASV accuracy has substantially increased throughout the past decade due to recent advances in machine learning, particularly deep learning methods. An unfortunate downside has been substantially increased complexity of ASV systems. To help non-experts to kick-start reproducible ASV development, a state-of-the-art toolkit implementing various ASV pipelines and functionalities is required. To this end, we introduce a new open-source toolkit, ASVtorch, implemented in Python using the widely used PyTorch machine learning framework.
Published: 2021
Full Text: View/download PDF

2. Gamified Speaker Comparison by Listening

Author: Sandip Ghimire, Tomi Kinnunen, and Rosa González Hautamäki
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We address speaker comparison by listening in a game-like environment, hypothesized to make the task more motivating for naive listeners. We present the same 30 trials selected with the help of an x-vector speaker recognition system from VoxCeleb to a total of 150 crowdworkers recruited through Amazon's Mechanical Turk. They are divided into cohorts of 50, each using one of three alternative interface designs: (i) a traditional (nongamified) design; (ii) a gamified design with feedback on decisions, along with points, game level indications, and possibility for interface customization; (iii) another gamified design with an additional constraint of maximum of 5 'lives' consumed by wrong answers. We analyze the impact of these interface designs to listener error rates (both misses and false alarms), probability calibration, time of quitting, along with survey questionnaire. The results indicate improved performance from (i) to (ii) and (iii), particularly in terms of balancing the two types of detection errors., Accepted to Odyssey 2022 The Speaker and Language Recognition Workshop
Published: 2022

3. Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion

Author: Hye-jin Shim, Hemlata Tak, Xuechen Liu, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung, Soo-Whan Chung, Ha-Jin Yu, Bong-Jin Lee, Massimiliano Todisco, Héctor Delgado, Kong Aik Lee, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Seoul National University [Seoul] (SNU), Eurecom [Sophia Antipolis], University of Eastern Finland, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Naver Corporation, Department of Electrical Engineering [Korea Advanced Institute of Science and Technology] (KAIST), Korea Advanced Institute of Science and Technology (KAIST), Nuance Communications [Spain], Agency for science, technology and research [Singapore] (A*STAR), and ANR-19-CE23-0001,ExTENSoR,Réseaux de neurones évolutifs end-to-end pour la reconnaissance du locuteur(2019)
Subjects: FOS: Computer and information sciences, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Sound (cs.SD), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning has brought impressive progress in the study of both automatic speaker verification (ASV) and spoofing countermeasures (CM). Although solutions are mutually dependent, they have typically evolved as standalone sub-systems whereby CM solutions are usually designed for a fixed ASV system. The work reported in this paper aims to gauge the improvements in reliability that can be gained from their closer integration. Results derived using the popular ASVspoof2019 dataset indicate that the equal error rate (EER) of a state-of-the-art ASV system degrades from 1.63% to 23.83% when the evaluation protocol is extended with spoofed trials.%subjected to spoofing attacks. However, even the straightforward integration of ASV and CM systems in the form of score-sum and deep neural network-based fusion strategies reduce the EER to 1.71% and 6.37%, respectively. The new Spoofing-Aware Speaker Verification (SASV) challenge has been formed to encourage greater attention to the integration of ASV and CM systems as well as to provide a means to benchmark different solutions., 8 pages, accepted by Odyssey 2022
Published: 2022

4. Optimizing Tandem Speaker Verification and Anti-Spoofing Systems

Author: Anssi Kanervisto, Ville Hautamaki, Tomi Kinnunen, and Junichi Yamagishi
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer Science - Cryptography and Security, Acoustics and Ultrasonics, Computer Science - Sound, Machine Learning (cs.LG), Computational Mathematics, Audio and Speech Processing (eess.AS), Computer Science (miscellaneous), FOS: Electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Cryptography and Security (cs.CR), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with a tandem detection cost function (t-DCF). However, ASV and CM systems are usually trained separately, using different metrics and data, which does not optimize their combined performance. In this work, we propose to optimize the tandem system directly by creating a differentiable version of t-DCF and employing techniques from reinforcement learning. The results indicate that these approaches offer better outcomes than finetuning, with our method providing a 20% relative improvement in the t-DCF in the ASVSpoof19 dataset in a constrained setting., Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing. Published version available at: https://ieeexplore.ieee.org/document/9664367
Published: 2022

5. Learnable Nonlinear Compression for Robust Speaker Verification

Author: Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and University of Eastern Finland
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Artificial Intelligence, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Speaker Verification, Multi-Regime Compression, Computer Science - Sound, Artificial Intelligence (cs.AI), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Nonlinear Compression, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this study, we focus on nonlinear compression methods in spectral features for speaker verification based on deep neural network. We consider different kinds of channel-dependent (CD) nonlinear compression methods optimized in a data-driven manner. Our methods are based on power nonlinearities and dynamic range compression (DRC). We also propose multi-regime (MR) design on the nonlinearities, at improving robustness. Results on VoxCeleb1 and VoxMovies data demonstrate improvements brought by proposed compression methods over both the commonly-used logarithm and their static counterparts, especially for ones based on power function. While CD generalization improves performance on VoxCeleb1, MR provides more robustness on VoxMovies, with a maximum relative equal error rate reduction of 21.6%., Comment: Accepted by ICASSP2022
Published: 2022
Full Text: View/download PDF

6. Improving speaker de-identification with functional data analysis of f0 trajectories

Author: Lauri Tavi, Tomi Kinnunen, and Rosa González Hautamäki
Subjects: FOS: Computer and information sciences, Linguistics and Language, Sound (cs.SD), Computer Science - Computation and Language, Communication, Language and Linguistics, Computer Science - Sound, Computer Science Applications, Audio and Speech Processing (eess.AS), Modeling and Simulation, FOS: Electrical engineering, electronic engineering, information engineering, Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Software, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Due to a constantly increasing amount of speech data that is stored in different types of databases, voice privacy has become a major concern. To respond to such concern, speech researchers have developed various methods for speaker de-identification. The state-of-the-art solutions utilize deep learning solutions which can be effective but might be unavailable or impractical to apply for, for example, under-resourced languages. Formant modification is a simpler, yet effective method for speaker de-identification which requires no training data. Still, remaining intonational patterns in formant-anonymized speech may contain speaker-dependent cues. This study introduces a novel speaker de-identification method, which, in addition to simple formant shifts, manipulates f0 trajectories based on functional data analysis. The proposed speaker de-identification method will conceal plausibly identifying pitch characteristics in a phonetically controllable manner and improve formant-based speaker de-identification up to 25%., Comment: Accepted to Speech Communication. March 2022
Published: 2022
Full Text: View/download PDF

7. Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

Author: Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and University of Eastern Finland
Subjects: spoofing-aware speaker verification, FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Artificial Intelligence, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], antispoofing, Artificial Intelligence (cs.AI), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, unsupervised domain adaptation, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module. We start from the standard ASV framework of the ASVspoof 2019 baseline and approach the problem from the back-end classifier based on probabilistic linear discriminant analysis. We employ three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset. We demonstrate notable improvements on both logical and physical access scenarios, especially on the latter where the system is attacked by replayed audios, with a maximum of 36.1% and 5.3% relative improvement on bonafide and spoofed cases, respectively. We perform additional studies such as per-attack breakdown analysis, data composition, and integration with a countermeasure system at score-level with Gaussian back-end., Comment: Accepted by Speaker Odyssey 2022
Published: 2022
Full Text: View/download PDF

8. Benchmarking and challenges in security and privacy for voice biometrics

Author: Emmanuel Vincent, Jose Patino, Natalia A. Tomashenko, Kong Aik Lee, Jean-François Bonastre, Sahidullah, Massimiliano Todisco, Nicholas Evans, Xin Wang, Paul-Gauthier Noé, Héctor Delgado, Xuechen Liu, Tomi Kinnunen, Junichi Yamagishi, Andreas Nautsch, Brij Mohan Lal Srivastava, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Nuance Communications [Spain], Eurecom [Sophia Antipolis], University of Eastern Finland, Institute for Infocomm Research - I²R [Singapore], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Machine Learning in Information Networks (MAGNET), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), National Institute of Informatics (NII), ISCA, ANR-19-DATA-0008,Harpocrates,Open data, outils et challenges pour l'anonymisation des voix(2019), ANR-18-CE23-0018,DEEP-PRIVACY,Apprentissage distribué, personnalisé, préservant la privacité pour le traitement de la parole(2018), ANR-18-JSTS-0001,VoicePersonae,Clonage et protection de l'identité vocale(2018), and European Project: 825081,H2020,COMPRISE(2018)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Cryptography and Security, Spoofing attack, Computer science, business.industry, Internet privacy, Speech technology, Multidisciplinary Collaboration, Benchmarking, [INFO.INFO-IA]Computer Science [cs]/Computer Aided Engineering, User expectations, Speaker recognition, Computer Science - Sound, SPSC, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Related research, business, Cryptography and Security (cs.CR), Reliability (statistics), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: For many decades, research in speech technologies has focused upon improving reliability. With this now meeting user expectations for a range of diverse applications, speech technology is today omni-present. As result, a focus on security and privacy has now come to the fore. Here, the research effort is in its relative infancy and progress calls for greater, multidisciplinary collaboration with security, privacy, legal and ethical experts among others. Such collaboration is now underway. To help catalyse the efforts, this paper provides a high-level overview of some related research. It targets the non-speech audience and describes the benchmarking methodology that has spearheaded progress in traditional research and which now drives recent security and privacy initiatives related to voice biometrics. We describe: the ASVspoof challenge relating to the development of spoofing countermeasures; the VoicePrivacy initiative which promotes research in anonymisation for privacy preservation., Comment: Submitted to the symposium of the ISCA Security & Privacy in Speech Communications (SPSC) special interest group
Published: 2021
Full Text: View/download PDF

9. Optimizing Multi-Taper Features for Deep Speaker Verification

Author: Sahidullah, Xuechen Liu, Tomi Kinnunen, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and University of Eastern Finland
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Computer Science - Artificial Intelligence, Speech recognition, Word error rate, 02 engineering and technology, Discrete Fourier transform, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Robustness (computer science), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Hardware_INTEGRATEDCIRCUITS, Electrical and Electronic Engineering, Artificial neural network, Applied Mathematics, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Spectral density, Estimator, 020206 networking & telecommunications, Mixture model, Artificial Intelligence (cs.AI), Signal Processing, Mel-frequency cepstrum, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multi-taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs). Even if past work has reported promising automatic speaker verification (ASV) results with Gaussian mixture model-based classifiers, the performance of multi-taper MFCCs with deep ASV systems remains an open question. Instead of a static-taper design, we propose to optimize the multi-taper estimator jointly with a deep neural network trained for ASV tasks. With a maximum improvement on the SITW corpus of 25.8% in terms of equal error rate over the static-taper, our method helps preserve a balanced level of leakage and variance, providing more robustness., To appear in IEEE Signal Processing Letters
Published: 2021
Full Text: View/download PDF

10. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Author: Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, Andreas Nautsch, Kong Aik Lee, Jose Patino, Xuechen Liu, Tomi Kinnunen, Héctor Delgado, Sahidullah, Xin Wang, National Institute of Informatics (NII), Eurecom [Sophia Antipolis], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), University of Eastern Finland, Institute for Infocomm Research - I²R [Singapore], Nuance Communications [Spain], ASVspoof consortium, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Spoofing attack, Computer Science - Cryptography and Security, Computer science, Field (computer science), Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Task (project management), Machine Learning (cs.LG), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Human–computer interaction, Logical conjunction, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Voice activity detection, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Variable (computer science), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Physical access, Cryptography and Security (cs.CR), Communication channel, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years., Accepted to the ASVspoof 2021 Workshop
Published: 2021

11. Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals

Author: Kong Aik Lee, Massimiliano Todisco, Héctor Delgado, Xin Wang, Tomi Kinnunen, Ville Vestman, Nicholas Evans, Douglas A. Reynolds, Sahidullah, Junichi Yamagishi, Andreas Nautsch, University of Eastern Finland, Eurecom [Sophia Antipolis], NEC Corporation, National Institute of Informatics (NII), Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), MIT Lincoln Laboratory, Massachusetts Institute of Technology (MIT), This work has been sponsored by Academy of Finland (proj. no. 309629), Japan Science and Technology (JST), and the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government. The work has also been partially funded by the EU H2020 research and innovation programme under the MSCA grant agreement No. 860813 (TReSPAsS-ETN), the ANR-DFG FrenchGerman joint project ANR-18-CE92-0024 (RESPECT), the ANR project ANR-19-DATA-0008 (HARPOCRATES), the ANR project ExTENSoR and the JST-ANR Japanese-French project VoicePersonae., ANR-18-CE92-0024,RESPECT,Authentification multi-biométrique des personnes, fiable, sécurisée et préservant la vie privée(2018), and ANR-19-DATA-0008,Harpocrates,Open data, outils et challenges pour l'anonymisation des voix(2019)
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Spoofing attack, spoofing countermeasures, Acoustics and Ultrasonics, Computer science, media_common.quotation_subject, Word error rate, Machine learning, computer.software_genre, Computer Science - Sound, Electronic mail, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, presentation attack detection, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical Engineering and Systems Science - Signal Processing, Electrical and Electronic Engineering, Special case, Function (engineering), Reliability (statistics), media_common, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], [STAT.AP]Statistics [stat]/Applications [stat.AP], business.industry, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], detection cost function, Speech processing, Computational Mathematics, Metric (unit), Artificial intelligence, 0305 other medical science, business, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, computer, automatic speaker verification, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs. The reliability of spoofing CMs is typically gauged using the equal error rate (EER) metric. The primitive EER fails to reflect application requirements and the impact of spoofing and CMs upon ASV and its use as a primary metric in traditional ASV research has long been abandoned in favour of risk-based approaches to assessment. This paper presents several new extensions to the tandem detection cost function (t-DCF), a recent risk-based approach to assess the reliability of spoofing CMs deployed in tandem with an ASV system. Extensions include a simplified version of the t-DCF with fewer parameters, an analysis of a special case for a fixed ASV system, simulations which give original insights into its interpretation and new analyses using the ASVspoof 2019 database. It is hoped that adoption of the t-DCF for the CM assessment will help to foster closer collaboration between the anti-spoofing and ASV research communities., Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)
Published: 2020
Full Text: View/download PDF

12. On the limits of automatic speaker verification: Explaining degraded recognizer scores through acoustic changes resulting from voice disguise

Author: Rosa González Hautamäki, Ville Hautamäki, and Tomi Kinnunen
Subjects: 021110 strategic, defence & security studies, Speaker verification, Formant, Acoustics and Ultrasonics, Arts and Humanities (miscellaneous), Computer science, Speech recognition, 0211 other engineering and technologies, 02 engineering and technology, Speaker recognition, Focus (linguistics)
Abstract: In speaker verification research, objective performance benchmarking of listeners and automatic speaker verification (ASV) systems are of key importance in understanding the limits of speaker recognition. While the adoption of common data and metrics has been instrumental to progress in ASV, there are two major shortcomings. First, the utterances lack intentional voice changes imposed by the speaker. Second, the standard evaluation metrics focus on average performance across all speakers and trials. As a result, a knowledge gap remains in how the acoustic changes impact recognition performance at the level of individual speakers. This paper addresses the limits of speaker recognition in ASV systems under voice disguise using a linear mixed effects model to analyze the impact of change in long-term statistics of selected features (formants F1-F4, the bandwidths B1-B4, F0, and speaking rate) to ASV log-likelihood ratio (LLR) score. The correlations between the proposed predictive model and the LLR scores are 0.72 for females and 0.81 for male speakers. As a whole, the difference in long-term F0 between enrollment and test utterances was found to be the individually most detrimental factor, even if the ASV system uses only spectral, rather than prosodic, features.
Published: 2019
Full Text: View/download PDF

13. Learnable MFCCs for Speaker Verification

Author: Sahidullah, Xuechen Liu, Tomi Kinnunen, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), University of Eastern Finland, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Artificial neural network, Computer science, Speech recognition, 020208 electrical & electronic engineering, Feature extraction, Word error rate, 02 engineering and technology, Filter bank, Computer Science - Sound, Discrete Fourier transform, Machine Learning (cs.LG), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], ComputingMethodologies_PATTERNRECOGNITION, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Discrete cosine transform, Mel-frequency cepstrum, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven versions of the four linear transforms of a standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel filterbank and discrete cosine transform (DCT). Results reported reach up to 6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error rate (EER) from static MFCCs, without additional tuning effort., Comment: Accepted to ISCAS 2021
Published: 2021
Full Text: View/download PDF

14. Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing

Author: Xin Wang, Kong Aik Lee, Massimiliano Todisco, Junichi Yamagishi, Andreas Nautsch, Md. Sahidullah, Héctor Delgado, Tomi Kinnunen, Nicholas Evans, University of Eastern Finland, Eurecom [Sophia Antipolis], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), National Institute of Informatics (NII), Nuance Communications [Spain], Institute for Infocomm Research (I2R), This work was supported by a number of projects and funding sources: VoicePersonae, supported by the French Agence Nationale de la Recherche (ANR) and the Japan Science and Technology Agency (JST) with grant No. JPMJCR18A6, Academy of Finland (proj. 309629), Region Grand Est, France., Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, multi-dimensional scaling, 02 engineering and technology, Classifier, Statistics - Applications, Computer Science - Sound, Machine Learning (cs.LG), Set (abstract data type), 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Classifier (linguistics), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Applications (stat.AP), Representation (mathematics), [STAT.AP]Statistics [stat]/Applications [stat.AP], Receiver operating characteristic, business.industry, Visual comparison, 020206 networking & telecommunications, Pattern recognition, Mixture model, Automatic summarization, Adjacency list, Artificial intelligence, 0305 other medical science, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity. We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers in response to a common dataset. Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores and with close relation to receiver operating characteristic (ROC) and detection error trade-off (DET) analyses. While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems. The former are produced by a Gaussian mixture model system trained with VoxCeleb data whereas the latter stem from submissions to the ASVspoof 2019 challenge., Comment: Accepted to Interspeech 2021. Example code available at https://github.com/asvspoof-challenge/classifier-adjacency
Published: 2021
Full Text: View/download PDF

15. Data Quality as Predictor of Voice Anti-Spoofing Generalization

Author: Md. Sahidullah, Rosa González Hautamäki, Bhusan Chettri, Tomi Kinnunen, Queen Mary University of London (QMUL), University of Eastern Finland, Department of Electrical and Computer Engineering, National University of Singapore (NUS), Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), This work was supported in part by the Academy of Finland (Proj. N° 309629) and Nokia Foundation., Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Generalization, Computer science, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Anti-spoofing, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Anti spoofing, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, data quality, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 020206 networking & telecommunications, Data quality, interpretative models, 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Many anti-spoofing methods have been proposed but most of them fail to generalize across domains (corpora) -- and we do not know \emph{why}. We outline a novel interpretative framework for gauging the impact of data quality upon anti-spoofing performance. Our within- and between-domain experiments pool data from seven public corpora and three anti-spoofing methods based on Gaussian mixture and convolutive neural network models. We assess the impacts of long-term spectral information, speaker population (through x-vector speaker embeddings), signal-to-noise ratio, and selected voice quality features., Comment: INTERSPEECH 2021
Published: 2021
Full Text: View/download PDF

16. Neural i-vectors

Author: Kong Aik Lee, Ville Vestman, and Tomi Kinnunen
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, 0303 health sciences, business.industry, Computer science, Pattern recognition, Mixture model, Speaker recognition, 01 natural sciences, Machine Learning (cs.LG), Extractor, 03 medical and health sciences, Audio and Speech Processing (eess.AS), Bundle, 0103 physical sciences, FOS: Electrical engineering, electronic engineering, information engineering, Embedding, Artificial intelligence, Layer (object-oriented design), 010306 general physics, business, Sufficient statistic, Generative grammar, Electrical Engineering and Systems Science - Audio and Speech Processing, 030304 developmental biology
Abstract: Deep speaker embeddings have been demonstrated to outperform their generative counterparts, i-vectors, in recent speaker verification evaluations. To combine the benefits of high performance and generative interpretation, we investigate the use of deep embedding extractor and i-vector extractor in succession. To bundle the deep embedding extractor with an i-vector extractor, we adopt aggregation layers inspired by the Gaussian mixture model (GMM) to the embedding extractor networks. The inclusion of GMM-like layer allows the discriminatively trained network to be used as a provider of sufficient statistics for the i-vector extractor to extract what we call neural i-vectors. We compare the deep embeddings to the proposed neural i-vectors on the Speakers in the Wild (SITW) and the Speaker Recognition Evaluation (SRE) 2018 and 2019 datasets. On the core-core condition of SITW, our deep embeddings obtain performance comparative to the state-of-the-art. The neural i-vectors obtain about 50% worse performance than the deep embeddings, but on the other hand outperform the previous i-vector approaches reported in the literature by a clear margin., Comment: Accepted to Odyssey 2020: The Speaker and Language Recognition Workshop. Version 2 (bugfix)
Published: 2020
Full Text: View/download PDF

17. Subband Modeling for Spoofing Detection in Automatic Speaker Verification

Author: Emmanouil Benetos, Bhusan Chettri, and Tomi Kinnunen
Subjects: Audio signal, Spoofing attack, Artificial neural network, business.industry, Computer science, Pattern recognition, Discriminative model, Audio and Speech Processing (eess.AS), Margin (machine learning), Classifier (linguistics), FOS: Electrical engineering, electronic engineering, information engineering, Benchmark (computing), Spectrogram, Artificial intelligence, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Spectrograms - time-frequency representations of audio signals - have found widespread use in neural network-based spoofing detection. While deep models are trained on the fullband spectrum of the signal, we argue that not all frequency bands are useful for these tasks. In this paper, we systematically investigate the impact of different subbands and their importance on replay spoofing detection on two benchmark datasets: ASVspoof 2017 v2.0 and ASVspoof 2019 PA. We propose a joint subband modelling framework that employs n different sub-networks to learn subband specific features. These are later combined and passed to a classifier and the whole network weights are updated during training. Our findings on the ASVspoof 2017 dataset suggest that the most discriminative information appears to be in the first and the last 1 kHz frequency bands, and the joint model trained on these two subbands shows the best performance outperforming the baselines by a large margin. However, these findings do not generalise on the ASVspoof 2019 PA dataset. This suggests that the datasets available for training these models do not reflect real world replay conditions suggesting a need for careful design of datasets for training replay spoofing countermeasures., Accepted to the Speaker Odyssey (The Speaker and Language Recognition Workshop) 2020 conference. 8 pages
Published: 2020
Full Text: View/download PDF

18. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Author: Hirokazu Kameoka, Hsin-Te Hwang, Driss Matrouf, Markus Becker, Quan Wang, Sahidullah, Ye Jia, Yu Zhang, Lauri Juvela, Hsin-Min Wang, Wen-Chin Huang, Zhen-Hua Ling, Yuan Jiang, Yi-Chiao Wu, Héctor Delgado, Massimiliano Todisco, Yu Tsao, Li-Juan Liu, Junichi Yamagishi, Jean-François Bonastre, Tomoki Toda, Nicholas Evans, Robert A. J. Clark, Kai Onuma, Yu-Huai Peng, Sébastien Le Maguer, Avashna Govender, Takashi Kaneda, Andreas Nautsch, Kong Aik Lee, Xin Wang, Srikanth Ronanki, Ville Vestman, Koji Mushika, Ingmar Steiner, Tomi Kinnunen, Fergus Henderson, Jing-Xuan Zhang, Kou Tanaka, Paavo Alku, Hitotsubashi University, University of Edinburgh, Eurecom [Sophia Antipolis], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), University of Eastern Finland, NEC Corporation, Aalto University, Academia Sinica, ADAPT Centre, Sigmedia Lab, EE Engineering, Trinity College Dublin, Google Inc [Mountain View], Research at Google, Hoya Corp., iFlytek Research, Nagoya City University [Nagoya, Japan], NTT Communication Science Laboratories, NTT Corporation, audEERING GmbH, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, The Centre for Speech Technology Research [Edinburgh] (CSTR), Southern University of Science and Technology (SUSTech), The work was partially supported by JST CREST Grant No. JPMJCR18A6 (VoicePersonae project), Japan, MEXT KAKENHI Grant Nos. (16H06302, 16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan, the VoicePersonae and RESPECT projects funded by the French Agence Nationale de la Recherche (ANR), the Academy of Finland (NOTCH project no. 309629), and Region Grand Est, France. entitled 'NOTCH: NOn-cooperaTive speaker CHaracterization'). The authors at the University of Eastern Finland also gratefully acknowledge the use of the computational infrastructures at CSC – the IT Center for Science, and the support of the NVIDIA Corporation the donation of a Titan V GPU used in this research. The numerical calculations of some of the spoofed data were carried out on the TSUBAME3.0 supercomputer at the Tokyo Institute of Technology. The work is also partially supported by Region Grand Est, France. The ADAPT centre (13/RC/2106) is funded by the Science Foundation Ireland (SFI)., National Institute of Informatics, EURECOM, Université de Lorraine, Dept Signal Process and Acoust, Trinity College Dublin, Google, USA, HOYA Corporation, IFLYTEK Co., Ltd., Nagoya University, AudEERING GmbH, Avignon Université, University of Science and Technology of China, Aalto-yliopisto, and Southern University of Science and Technology of China (SUSTech)
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Sound (cs.SD), ASVspoof challenge, biometrics, Computer Science - Cryptography and Security, voice conversion, Computer science, Speech synthesis, 02 engineering and technology, computer.software_genre, 01 natural sciences, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], text-to-speech synthesis, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, Replay, Use case, media forensics, 010301 acoustics, Protocol (object-oriented programming), Text-to-speech synthesis, Database, presentation attack, ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS, Automatic speaker verification, Cryptography and Security (cs.CR), [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing, automatic speaker verification, Voice conversion, Spoofing attack, Biometrics, anti-spoofing, Reliability (computer networking), Database design, Theoretical Computer Science, replay, presentation attack detection, 0103 physical sciences, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 020206 networking & telecommunications, Human-Computer Interaction, Physical access, computer, countermeasure, Software
Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects., Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114
Published: 2020
Full Text: View/download PDF

19. Voice Conversion Challenge 2020 –- Intra-lingual semi-parallel and cross-lingual voice conversion –

Author: Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, and Tomoki Toda
Published: 2020
Full Text: View/download PDF

20. A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings

Author: Xuechen Liu, Tomi Kinnunen, Sahidullah, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and University of Eastern Finland
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, 0209 industrial biotechnology, Computer science, Speech recognition, Feature extraction, Word error rate, 02 engineering and technology, Computer Science - Sound, Machine Learning (cs.LG), 020901 industrial engineering & automation, Audio and Speech Processing (eess.AS), Cepstrum, Index Terms: Speaker verification, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], Prosody, deep speaker embeddings, feature extraction, [STAT]Statistics [stat], Feature (computer vision), 020201 artificial intelligence & image processing, Mel-frequency cepstrum, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Modern automatic speaker verification relies largely on deep neural networks (DNNs) trained on mel-frequency cepstral coefficient (MFCC) features. While there are alternative feature extraction methods based on phase, prosody and long-term temporal operations, they have not been extensively studied with DNN-based methods. We aim to fill this gap by providing extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets. Our findings reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction. Experimental results demonstrate up to 16.3\% (VoxCeleb) and 25.1\% (SITW) relative decrease in equal error rate (EER) to the baseline., Accepted to Interspeech 2020
Published: 2020
Full Text: View/download PDF

21. The Attacker’s Perspective on Automatic Speaker Verification: An Overview

Author: Haizhou Li, Rohan Kumar Das, Xiaohai Tian, and Tomi Kinnunen
Subjects: Speaker verification, Spoofing attack, Computer science, Perspective (graphical), 02 engineering and technology, Computer security, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, Adversarial system, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0305 other medical science, computer
Abstract: Security of automatic speaker verification (ASV) systems is compromised by various spoofing attacks. While many types of non-proactive attacks (and their defenses) have been studied in the past, attacker's perspective on ASV, represents a far less explored direction. It can potentially help to identify the weakest parts of ASV systems and be used to develop attacker-aware systems. We present an overview on this emerging research area by focusing on potential threats of adversarial attacks on ASV, spoofing countermeasures, or both. We conclude the study with discussion on selected attacks and leveraging from such knowledge to improve defense mechanisms against adversarial attacks.
Published: 2020
Full Text: View/download PDF

22. Extrapolating false alarm rates in automatic speaker verification

Author: Alexey Sholokhov, Ville Vestman, Kong Aik Lee, and Tomi Kinnunen
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Speaker verification, Computer science, Speech recognition, Extrapolation, Large population, Sampling (statistics), 020206 networking & telecommunications, Machine Learning (stat.ML), 02 engineering and technology, Adversary, Space (commercial competition), Constant false alarm rate, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, False alarm, 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic speaker verification (ASV) vendors and corpus providers would both benefit from tools to reliably extrapolate performance metrics for large speaker populations without collecting new speakers. We address false alarm rate extrapolation under a worst-case model whereby an adversary identifies the closest impostor for a given target speaker from a large population. Our models are generative and allow sampling new speakers. The models are formulated in the ASV detection score space to facilitate analysis of arbitrary ASV systems., Accepted for publication to Interspeech 2020
Published: 2020

23. Vocal effort compensation for MFCC feature extraction in a shouted versus normal speaker recognition task

Author: Tomi Kinnunen, Paavo Alku, Rahim Saeidi, Emma Jokinen, Dept Signal Process and Acoust, University of Eastern Finland, Aalto-yliopisto, and Aalto University
Subjects: ta213, Computer science, Speech recognition, Feature extraction, 020206 networking & telecommunications, 02 engineering and technology, Speaker recognition, 01 natural sciences, Shouted speech, Theoretical Computer Science, Task (project management), Compensation (engineering), Human-Computer Interaction, Identification (information), ComputingMethodologies_PATTERNRECOGNITION, Vocal effort, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Mel-frequency cepstrum, Vocal effort mismatch, Environmental noise, 010301 acoustics, Software
Abstract: In shouting, speakers use increased vocal effort to convey spoken messages over distance or above environmental noise. For automatic speaker recognition systems trained using normal speech, shouting causes a severe vocal effort mismatch between the enrollment and test hence reducing the recognition performance. In this study, two compensation methods are proposed to tackle the mismatch in a shouted versus normal speaker recognition task. These techniques are applied in the feature extraction stage of a speaker recognition system to modify the spectral envelopes of shouts to be closer to those in normal speech. The techniques modify the all-pole power spectrum of the MFCC computation chain with shouted-to-normal compensation filtering that is obtained using a GMM-based statistical mapping. In an evaluation using the state-of-the-art i-vector based recognition system, the proposed techniques provided considerable improvements in identification rates compared to the case when shouted speech spectra were not processed.
Published: 2019
Full Text: View/download PDF

24. An initial investigation on optimizing tandem speaker verification and countermeasure systems using reinforcement learning

Author: Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi, and Anssi Kanervisto
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Spoofing attack, Computer science, media_common.quotation_subject, Sample (statistics), Machine Learning (stat.ML), Machine learning, computer.software_genre, Computer Science - Sound, Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, Reinforcement learning, Isolation (database systems), Function (engineering), media_common, Measure (data warehouse), business.industry, Supervised learning, Artificial intelligence, business, computer, Countermeasure (computer), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The spoofing countermeasure (CM) systems in automatic speaker verification (ASV) are not typically used in isolation of each other. These systems can be combined, for example, into a cascaded system where CM produces first a decision whether the input is synthetic or bona fide speech. In case the CM decides it is a bona fide sample, then the ASV system will consider it for speaker verification. End users of the system are not interested in the performance of the individual sub-modules, but instead are interested in the performance of the combined system. Such combination can be evaluated with tandem detection cost function (t-DCF) measure, yet the individual components are trained separately from each other using their own performance metrics. In this work we study training the ASV and CM components together for a better t-DCF measure by using reinforcement learning. We demonstrate that such training procedure indeed is able to improve the performance of the combined system, and does so with more reliable results than with the standard supervised learning techniques we compare against., Odyssey 2020 The Speaker and Language Recognition Workshop. Code available at https://github.com/Miffyli/asv-cm-reinforce
Published: 2020

25. Introduction to voice presentation attack detection and recent advances

Author: Md Sahidullah, Hector Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi and Kong-Aik Lee and Md Sahidullah, Hector Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi and Kong-Aik Lee
Published: 2018

26. Acoustical and perceptual study of voice disguise by age modification in speaker verification

Author: Rosa González Hautamäki, Tomi Kinnunen, Sahidullah, Ville Hautamäki, and School of Computing, activities
Subjects: Linguistics and Language, Acoustical analysis, Speech recognition, media_common.quotation_subject, 02 engineering and technology, 01 natural sciences, Language and Linguistics, Perception, Voice modification, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Active listening, Fundamental frequency, 010301 acoustics, media_common, Point (typography), Communication, Perspective (graphical), 020206 networking & telecommunications, Voice disguise, Speaker recognition, Computer Science Applications, Speaker diarisation, Formant, Modeling and Simulation, Formant frequencies, Speaker verification, Perceptual evaluation, Computer Vision and Pattern Recognition, Psychology, Software, Utterance
Abstract: The task of speaker recognition is feasible when the speakers are co-operative or wish to be recognized. While modern automatic speaker verification (ASV) systems and some listeners are good at recognizing speakers from modal, unmodified speech, the task becomes notoriously difficult in situations of deliberate voice disguise when the speaker aims at masking his or her identity. We approach voice disguise from the perspective of acoustical and perceptual analysis using a self-collected corpus of 60 native Finnish speakers (31 female, 29 male) producing utterances in normal, intended young and intended old voice modes. The normal voices form a starting point and we are interested in studying how the two disguise modes impact the acoustical parameters and perceptual speaker similarity judgments. First, we study the effect of disguise as a relative change in fundamental frequency (F0) and formant frequencies (F1 to F4) from modal to disguised utterances. Next, we investigate whether or not speaker comparisons that are deemed easy or difficult by a modern ASV system have a similar difficulty level for the human listeners. Further, we study affecting factors from listener-related self-reported information that may explain a particular listener’s success or failure in speaker similarity assessment. Our acoustic analysis reveals a systematic increase in relative change in mean F0 for the intended young voices while for the intended old voices, the relative change is less prominent in most cases. Concerning the formants F1 through F4, 29% (for male) and 30% (for female) of the utterances did not exhibit a significant change in any formant value, while the remaining ∼ 70% of utterances had significant changes in at least one formant. Our listening panel consists of 70 listeners, 32 native and 38 non-native, who listened to 24 utterance pairs selected using rankings produced by an ASV system. The results indicate that speaker pairs categorized as easy by our ASV system were also easy for the average listener. Similarly, the listeners made more errors in the difficult trials. The listening results indicate that target (same speaker) trials were more difficult for the non-native group, while the performance for the non-target pairs was similar for both native and non-native groups., final draft, peerReviewed
Published: 2017
Full Text: View/download PDF

27. UIAI System for Short-Duration Speaker Verification Challenge 2020

Author: Achintya Kumar Sarkar, Romain Serizel, Zheng-Hua Tan, Ville Vestman, Emmanuel Vincent, Sahidullah, Xuechen Liu, Tomi Kinnunen, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Indian Institute of Information Technology, Sri City (IIIT Sri City), University of Eastern Finland, Aalborg University [Denmark] (AAU), IEEE, Short-duration Speaker Verification Challenge 2020, Grid'5000, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), and This work has been partially sponsored by Academy of Finland (proj. n° 309629).
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, Word error rate, 02 engineering and technology, SdSV challenge 2020, X-vector, Bottleneck, Computer Science - Sound, Task (project management), Utterance verification, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Set (abstract data type), 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Text-dependent speaker verification, Fusion, System fusion, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Time delay neural network, 020206 networking & telecommunications, Identification (information), Task analysis, Bottleneck feature, Time-delay neural network, Gaussian mixture model GMM, 0305 other medical science, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; In this work, we present the system description of the UIAI entry for the short-duration speaker verification (SdSV) challenge 2020. Our focus is on Task 1 dedicated to text-dependent speaker verification. We investigate different feature extraction and modeling approaches for automatic speaker verification (ASV) and utterance verification (UV). We have also studied different fusion strategies for combining UV and ASV modules. Our primary submission to the challenge is the fusion of seven subsystems which yields a normalized minimum detection cost function (minDCF) of 0.072 and an equal error rate (EER) of 2.14% on the evaluation set. The single system consisting of a pass-phrase identification based model with phone-discriminative bottleneck features gives a normalized minDCF of 0.118 and achieves 19% relative improvement over the state-of-the-art challenge baseline.
Published: 2020
Full Text: View/download PDF

28. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

Author: Tomi Kinnunen, Xin Wang, Ville Vestman, Massimiliano Todisco, Junichi Yamagishi, Kong Aik Lee, Andreas Nautsch, Nicholas Evans, Sahidullah, Héctor Delgado, Eurecom [Sophia Antipolis], National Institute of Informatics (NII), University of Eastern Finland, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and Institute for Infocomm Research - I²R [Singapore]
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Spoofing attack, Computer Science - Cryptography and Security, Computer science, Reliability (computer networking), 02 engineering and technology, spoofing, Computer security, computer.software_genre, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], presentation attack detection, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Baseline (configuration management), fake audio, Replay attack, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Focus (computing), ASVspoof, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], 020206 networking & telecommunications, ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Physical access, 0305 other medical science, computer, Cryptography and Security (cs.CR), [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing, automatic speaker verification
Abstract: International audience; ASVspoof, now in its third edition, is a series of community-led challenges which promote the development of countermeasures to protect automatic speaker verification (ASV) from the threat of spoofing. Advances in the 2019 edition include: (i) a consideration of both logical access (LA) and physical access (PA) scenarios and the three major forms of spoofing attack, namely synthetic, converted and replayed speech; (ii) spoofing attacks generated with state-of-the-art neu-ral acoustic and waveform models; (iii) an improved, controlled simulation of replay attacks; (iv) use of the tandem detection cost function (t-DCF) that reflects the impact of both spoofing and countermeasures upon ASV reliability. Even if ASV remains the core focus, in retaining the equal error rate (EER) as a secondary metric, ASVspoof also embraces the growing importance of fake audio detection. ASVspoof 2019 attracted the participation of 63 research teams, with more than half of these reporting systems that improve upon the performance of two baseline spoofing countermeasures. This paper describes the 2019 database, protocols and challenge results. It also outlines major findings which demonstrate the real progress made in protecting against the threat of spoofing and fake audio.
Published: 2019
Full Text: View/download PDF

29. I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

Author: Ville Hautamäki, Hitoshi Yamamoto, Koji Okabe, Qing Wang, Rohan Kumar Das, Hanwu Sun, Massimiliano Todisco, Chunlei Zhang, Pierre-Michel Bousquet, Kong Aik Lee, Jing Huang, Tomi Kinnunen, Ville Vestman, Wei Rao, Haizhou Li, Héctor Delgado, Fahimeh Bahmaninezhad, Guohong Ding, Mickael Rouvier, and Anthony Larcher
Subjects: Computer science, 0202 electrical engineering, electronic engineering, information engineering, NIST, Embedding, 020206 networking & telecommunications, 020201 artificial intelligence & image processing, 02 engineering and technology, Speaker recognition, Data science
Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve subsystems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others , a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation.
Published: 2019
Full Text: View/download PDF

30. Introduction to the Special Issue 'Speaker and Language Characterization and Recognition: Voice Modeling, Conversion, Synthesis and Ethical Aspects'

Author: Jean-François Bonastre, Anthony Larcher, Junichi Yamagishi, Tomi Kinnunen, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, University of Eastern Finland, Institute for Infocomm Research - I²R [Singapore], and National Institute of Informatics (NII)
Subjects: Human-Computer Interaction, 030507 speech-language pathology & audiology, 03 medical and health sciences, Human–computer interaction, Computer science, 0103 physical sciences, 0305 other medical science, 010301 acoustics, 01 natural sciences, Software, Theoretical Computer Science, Characterization (materials science), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Welcome to this special issue on Speaker and Language Characterization which features, among other contributions, some of the most remarkable ideas presented and discussed at Odyssey 2018: the Speaker and Language Recognition Workshop, held in Les Sables d'Olonne, France, in June 2018. This issue perpetuates the series proposed by ISCA Speaker and language Characterization Special Interest Group in coordination with ISCA Speaker Odyssey workshops [1, 2, 3]. Voice is one of the most casual modalities for natural and intuitive interactions between humans as well as between humans and machines. Voice is also a central part of our identity. Voice-based solutions are currently deployed in a growing variety of applications, including person authentication through automatic speaker verification (ASV). A related technology concerns digital cloning of personal voice characteristics for text-to-speech (TTS) and voice conversion (VC). In the last years, the impressive advancements of the VC/TTS field opened the way for numerous new consumer applications. Especially, VC is offering new solutions for privacy protection. However, VC/TTS also brings the possibility of misuse of the technology in order to spoof ASV systems (for example presentation attacks implemented using voice conversion). As a direct consequence, spoofing countermeasures raises a growing interest during the past years. Moreover, voice is a central part of our identity and is also bringing other
Published: 2019
Full Text: View/download PDF

31. Voice Mimicry Attacks Assisted by Automatic Speaker Verification

Author: Rosa González Hautamäki, Sahidullah, Tomi Kinnunen, Ville Vestman, University of Eastern Finland, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Speaker verification, Spoofing attack, Computer Science - Cryptography and Security, Computer science, Speech recognition, Nearest neighbor search, 02 engineering and technology, spoofing, Crowdsourcing, 01 natural sciences, Computer Science - Sound, Theoretical Computer Science, Machine Learning (cs.LG), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], prosody, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, perceptual speaker similarity, Active listening, Prosody, 010301 acoustics, business.industry, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 020206 networking & telecommunications, Human-Computer Interaction, Formant, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Mimicry, automatic target speaker selection, crowdsourcing, business, Cryptography and Security (cs.CR), [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Software, mimicry, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we simulate a scenario, where a publicly available ASV system is used to enhance mimicry attacks against another closed source ASV system. In specific, ASV technology is used to perform a similarity search between the voices of recruited attackers (6) and potential target speakers (7,365) from VoxCeleb corpora to find the closest targets for each of the attackers. In addition, we consider 'median', 'furthest', and 'common' targets to serve as a reference points. Our goal is to gain insights how well similarity rankings transfer from the attacker's ASV system to the attacked ASV system, whether the attackers are able to improve their attacks by mimicking, and how the properties of the voices of attackers change due to mimicking. We address these questions through ASV experiments, listening tests, and prosodic and formant analyses. For the ASV experiments, we use i-vector technology in the attacker side, and x-vectors in the attacked side. For the listening tests, we recruit listeners through crowdsourcing. The results of the ASV experiments indicate that the speaker similarity scores transfer well from one ASV system to another. Both the ASV experiments and the listening tests reveal that the mimicry attempts do not, in general, help in bringing attacker's scores closer to the target's. A detailed analysis shows that mimicking does not improve attacks, when the natural voices of attackers and targets are similar to each other. The analysis of prosody and formants suggests that the attackers were able to considerably change their speaking rates when mimicking, but the changes in F0 and formants were modest. Overall, the results suggest that untrained impersonators do not pose a high threat towards ASV systems, but the use of ASV systems to attack other ASV systems is a potential threat., Published in Computer Speech and Language. arXiv admin note: text overlap with arXiv:1811.03790
Published: 2019
Full Text: View/download PDF

32. Can We Use Speaker Recognition Technology to Attack Itself? Enhancing Mimicry Attacks Using Automatic Target Speaker Selection

Author: Rosa González Hautamäki, Sahidullah, Tomi Kinnunen, Ville Vestman, University of Eastern Finland, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and Sahidullah, Md
Subjects: FOS: Computer and information sciences, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], Spoofing attack, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Computer science, Speech recognition, Transferability, Machine Learning (stat.ML), Context (language use), 02 engineering and technology, spoofing, 01 natural sciences, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-CV] Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), Statistics - Machine Learning, 020204 information systems, 0103 physical sciences, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Selection (linguistics), [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], 010301 acoustics, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], [INFO.INFO-MM] Computer Science [cs]/Multimedia [cs.MM], [SPI.ACOU] Engineering Sciences [physics]/Acoustics [physics.class-ph], Computer Science - Computation and Language, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], [SCCO.LING]Cognitive science/Linguistics, Speaker recognition, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph], Mimicry, Speaker verification, [INFO.INFO-HC] Computer Science [cs]/Human-Computer Interaction [cs.HC], [SCCO.LING] Cognitive science/Linguistics, [PHYS.MECA.ACOU] Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph], Computation and Language (cs.CL), [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, mimicry, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We consider technology-assisted mimicry attacks in the context of automatic speaker verification (ASV). We use ASV itself to select targeted speakers to be attacked by human-based mimicry. We recorded 6 naive mimics for whom we select target celebrities from VoxCeleb1 and VoxCeleb2 corpora (7,365 potential targets) using an i-vector system. The attacker attempts to mimic the selected target, with the utterances subjected to ASV tests using an independently developed x-vector system. Our main finding is negative: even if some of the attacker scores against the target speakers were slightly increased, our mimics did not succeed in spoofing the x-vector system. Interestingly, however, the relative ordering of the selected targets (closest, furthest, median) are consistent between the systems, which suggests some level of transferability between the systems., (A slightly shorter version) has been submitted to IEEE ICASSP 2019
Published: 2019

33. Who Do I Sound like? Showcasing Speaker Recognition Technology by Youtube Voice Search

Author: Anssi Kanervisto, Bilal Soomro, Tomi Kinnunen, Ville Hautamäki, and Ville Vestman
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Voice search, business.industry, Computer science, Event (computing), 02 engineering and technology, Speaker recognition, computer.software_genre, Computer Science - Sound, World Wide Web, 030507 speech-language pathology & audiology, 03 medical and health sciences, Identification (information), Leverage (negotiation), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Web application, 020201 artificial intelligence & image processing, Web service, 0305 other medical science, business, computer, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The popularization of science can often be disregarded by scientists as it may be challenging to put highly sophisticated research into words that general public can understand. This work aims to help presenting speaker recognition research to public by proposing a publicly appealing concept for showcasing recognition systems. We leverage data from YouTube and use it in a large-scale voice search web application that finds the celebrity voices that best match to the user's voice. The concept was tested in a public event as well as "in the wild" and the received feedback was mostly positive. The i-vector based speaker identification back end was found to be fast (665 ms per request) and had a high identification accuracy (93 %) for the YouTube target speakers. To help other researchers to develop the idea further, we share the source codes of the web platform used for the demo at https://github.com/bilalsoomro/speech-demo-platform., Comment: Accepted for presentation in ICASSP 2019
Published: 2019
Full Text: View/download PDF

34. Local spectral variability features for speaker verification

Author: Sahidullah and Tomi Kinnunen
Subjects: business.industry, Computer science, Applied Mathematics, Speech recognition, Feature vector, Feature extraction, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Covariance, Speaker recognition, Computational Theory and Mathematics, Artificial Intelligence, Sliding window protocol, Signal Processing, Cepstrum, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Artificial intelligence, Mel-frequency cepstrum, Electrical and Electronic Engineering, Statistics, Probability and Uncertainty, business
Abstract: Speaker verification techniques neglect the short-time variation in the feature space even though it contains speaker related attributes. We propose a simple method to capture and characterize this spectral variation through the eigenstructure of the sample covariance matrix. This covariance is computed using sliding window over spectral features. The newly formulated feature vectors representing local spectral variations are used with classical and state-of-the-art speaker recognition systems. Results on multiple speaker recognition evaluation corpora reveal that eigenvectors weighted with their normalized singular values are useful in representing local covariance information. We have also shown that local variability features can be extracted using mel frequency cepstral coefficients (MFCCs) as well as using three recently developed features: frequency domain linear prediction (FDLP), mean Hilbert envelope coefficients (MHECs) and power-normalized cepstral coefficients (PNCCs). Since information conveyed in the proposed feature is complementary to the standard short-term features, we apply different fusion techniques. We observe considerable relative improvements in speaker verification accuracy in combined mode on text-independent (NIST SRE) and text-dependent (RSR2015) speech corpora. We have obtained up to 12.28% relative improvement in speaker recognition accuracy on text-independent corpora. Conversely in experiments on text-dependent corpora, we have achieved up to 40% relative reduction in EER. To sum up, combining local covariance information with the traditional cepstral features holds promise as an additional speaker cue in both text-independent and text-dependent recognition.
Published: 2016
Full Text: View/download PDF

35. Introduction to the Issue on Spoofing and Countermeasures for Automatic Speaker Verification

Author: Junichi Yamagishi, Nicholas Evans, Isabel Trancoso, Phillip L. De Leon, and Tomi Kinnunen
Subjects: Focus (computing), Authentication, Spoofing attack, Biometrics, Computer science, media_common.quotation_subject, Speech synthesis, 02 engineering and technology, Speaker recognition, computer.software_genre, Computer security, Speech processing, 030507 speech-language pathology & audiology, 03 medical and health sciences, Presentation, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Electrical and Electronic Engineering, 0305 other medical science, computer, media_common
Abstract: The papers in this special issue focus on automatic speaker veriﬁcation (ASV) technologies and applications for their use. ASV offers a low-cost and ﬂexible solution to biometric authentication. While there liability of ASV systems is now considered sufﬁcient to support mass-market adoption, there are concerns that the technology is vulnerable to spooﬁng, also referred to as presentation attacks. Spooﬁng refers to an attack whereby a fraudster attempts to manipulate a biometric system by masquerading as another, enrolled person. Replayed, synthesized and converted speech spooﬁng attacks can all be used to present high-quality, convincing speech signals which are representative of other, speciﬁc speakers and thus present a genuine threat to the reliability of ASV authentication systems.
Published: 2017
Full Text: View/download PDF

36. Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Author: Emmanouil Benetos, Tomi Kinnunen, and Bhusan Chettri
Subjects: Spoofing attack, Computer science, business.industry, Deep learning, Nonlinear dimensionality reduction, Word error rate, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Mixture model, 01 natural sciences, Convolutional neural network, Theoretical Computer Science, Human-Computer Interaction, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, business, 010301 acoustics, Encoder, Replay attack, Software
Abstract: Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount — yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs — one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 - 10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals — the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case.
Published: 2020
Full Text: View/download PDF

37. Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration

Author: Ville Vestman, Takafumi Koshinaka, Kong Aik Lee, and Tomi Kinnunen
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, Computation, Ensemble averaging, Machine Learning (stat.ML), 01 natural sciences, Computer Science - Sound, Machine Learning (cs.LG), 03 medical and health sciences, Acceleration, Margin (machine learning), Statistics - Machine Learning, Audio and Speech Processing (eess.AS), 0103 physical sciences, FOS: Electrical engineering, electronic engineering, information engineering, 010306 general physics, 030304 developmental biology, 0303 health sciences, Frame (networking), Process (computing), Computer engineering, Embedding, Central processing unit, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker embeddings are continuous-value vector representations that allow easy comparison between voices of speakers with simple geometric operations. Among others, i-vector and x-vector have emerged as the mainstream methods for speaker embedding. In this paper, we illustrate the use of modern computation platform to harness the benefit of GPU acceleration for i-vector extraction. In particular, we achieve an acceleration of 3000 times in frame posterior computation compared to real time and 25 times in training the i-vector extractor compared to the CPU baseline from Kaldi toolkit. This significant speed-up allows the exploration of ideas that were hitherto impossible. In particular, we show that it is beneficial to update the universal background model (UBM) and re-compute frame alignments while training the i-vector extractor. Additionally, we are able to study different variations of i-vector extractors more rigorously than before. In this process, we reveal some undocumented details of Kaldi's i-vector extractor and show that it outperforms the standard formulation by a margin of 1 to 2% when tested with VoxCeleb speaker verification protocol. All of our findings are asserted by ensemble averaging the results from multiple runs with random start., Comment: Accepted to Interspeech 2019
Published: 2019
Full Text: View/download PDF

38. Audio-visual kinship verification in the wild

Author: Abdenour Hadid, Xiaoyi Feng, Tomi Kinnunen, Eric Granger, and Xiaoting Wu
Subjects: Audio signal, Modalities, Relation (database), Computer science, business.industry, Deep learning, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Expression (mathematics), Face (geometry), 0202 electrical engineering, electronic engineering, information engineering, Kinship, 020201 artificial intelligence & image processing, Artificial intelligence, Affective computing, business
Abstract: Kinship verification is a challenging problem, where recognition systems are trained to establish a kin relation between two individuals based on facial images or videos. However, due to variations in capture conditions (background, pose, expression, illumination and occlusion), state-of-the-art systems currently provide a low level of accuracy. As in many visual recognition and affective computing applications, kinship verification may benefit from a combination of discriminant information extracted from both video and audio signals. In this paper, we investigate for the first time the fusion audio-visual information from both face and voice modalities to improve kinship verification accuracy. First, we propose a new multi-modal kinship dataset called TALking KINship (TALKIN), that is comprised of several pairs of video sequences with subjects talking. State-of-the-art conventional and deep learning models are assessed and compared for kinship verification using this dataset. Finally, we propose a deep Siamese network for multi-modal fusion of kinship relations. Experiments with the TALKIN dataset indicate that the proposed Siamese network provides a significantly higher level of accuracy over baseline uni-modal and multi-modal fusion techniques for kinship verification. Results also indicate that audio (vocal) information is complementary and useful for kinship verification problem.
Published: 2019

39. Introduction to Voice Presentation Attack Detection and Recent Advances

Author: Nicholas Evans, Kong Aik Lee, Sahidullah, Héctor Delgado, Tomi Kinnunen, Junichi Yamagishi, Massimiliano Todisco, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Eurecom [Sophia Antipolis], University of Eastern Finland, National Institute of Informatics (NII), NEC Corporation, Sébastien Marcel, Mark S. Nixon, Julian Fierrez, Nicholas Evans, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Marcel, S, Nixon, MS, Fierrez, J, Evans, N, and Tietojenkäsittelytieteen laitos / Tietojenkäsittelytieteen laitoksen toiminta
Subjects: Automatic speaker recognition, Spoofing attack, Computer science, Voice presentation, media_common.quotation_subject, Feature extraction, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 020206 networking & telecommunications, 02 engineering and technology, Benchmarking, Data science, Field (computer science), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 030507 speech-language pathology & audiology, 03 medical and health sciences, Presentation, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, media_common
Abstract: Over the past few years significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV). This includes the development of new speech corpora, standard evaluation protocols and advancements in front-end feature extraction and back-end classifiers. The use of standard databases and evaluation protocols has enabled for the first time the meaningful benchmarking of different PAD solutions. This chapter summarises the progress, with a focus on studies completed in the last three years. The article presents a summary of findings and lessons learned from two ASVspoof challenges, the first community-led benchmarking efforts. These show that ASV PAD remains an unsolved problem and that further attention is required to develop generalised PAD solutions which have potential to detect diverse and previously unseen spoofing attacks., final draft, peerReviewed
Published: 2019
Full Text: View/download PDF

40. Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems

Author: Isao Echizen, Tomi Kinnunen, Fuming Fang, Junichi Yamagishi, Sahidullah, National Institute of Informatics (NII), University of Edinburgh, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and University of Eastern Finland
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Spoofing attack, Computer Science - Cryptography and Security, Computer science, Speech recognition, Reliability (computer networking), Gaussian, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, Computer Science - Sound, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Replay attack, 0105 earth and related environmental sciences, Detector, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 020206 networking & telecommunications, Filter (signal processing), Speech enhancement, symbols, Cryptography and Security (cs.CR), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic speaker verification (ASV) systems use a playback detector to filter out playback attacks and ensure verification reliability. Since current playback detection models are almost always trained using genuine and played-back speech, it may be possible to degrade their performance by transforming the acoustic characteristics of the played-back speech close to that of the genuine speech. One way to do this is to enhance speech "stolen" from the target speaker before playback. We tested the effectiveness of a playback attack using this method by using the speech enhancement generative adversarial network to transform acoustic characteristics. Experimental results showed that use of this "enhanced stolen speech" method significantly increases the equal error rates for the baseline used in the ASVspoof 2017 challenge and for a light convolutional neural network-based method. The results also showed that its use degrades the performance of a Gaussian mixture model-universal background model-based ASV system. This type of attack is thus an urgent problem needing to be solved., Accepted at WIFS2018
Published: 2018
Full Text: View/download PDF

41. Integrated Presentation Attack Detection and Automatic Speaker Verification: Common Features and Gaussian Back-end Fusion

Author: Junichi Yamagishi, Kong Aik Lee, Massimiliano Todisco, Tomi Kinnunen, Sahidullah, Nicholas Evans, Héctor Delgado, Eurecom [Sophia Antipolis], NEC Corporation, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), University of Eastern Finland, National Institute of Informatics (NII), School of Engineering [Edinburgh], and University of Edinburgh
Subjects: fusion, Speaker verification, Spoofing attack, Computer science, Speech recognition, media_common.quotation_subject, Gaussian, 0211 other engineering and technologies, 02 engineering and technology, spoofing, 030507 speech-language pathology & audiology, 03 medical and health sciences, Presentation, symbols.namesake, presentation attack detection, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], GMM, Vulnerability (computing), media_common, 021110 strategic, defence & security studies, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], Contrast (statistics), [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], i-vector, countermeasures, Variety (cybernetics), symbols, 0305 other medical science, automatic speaker verification
Abstract: The vulnerability of automatic speaker verification (ASV) systems to spoofing is widely acknowledged. Recent years have seen an intensification in research efforts to develop spoofing countermeasures, also known as presentation attack detection (PAD) systems. Much of this work has involved the exploration of features that discriminate reliably between bona fide and spoofed speech. While there are grounds to use different frontends for ASV and PAD systems (they are different tasks) the use of a single front-end has obvious benefits, not least convenience and computational efficiency, especially when ASV and PAD are combined. This paper investigates the performance of a variety of different features used previously for both ASV and PAD and assesses their performance when combined for both tasks. The paper also presents a Gaussian back-end fusion approach to system combination. In contrast to cascaded architectures, it relies upon the modelling of the two-dimensional score distribution stemming from the combination of ASV and PAD in parallel. This approach to combination is shown to generalise particularly well across independent ASVspoof 2017 v2.0 development and evaluation datasets.Index Terms: automatic speaker verification, spoofing, countermeasures, presentation attack detection
Published: 2018
Full Text: View/download PDF

42. Audiovisual Synchrony Detection with Optimized Audio Features

Author: Sami Sieranoja, Abdenour Hadid, Jukka Komulainen, Tomi Kinnunen, Sahidullah, University of Eastern Finland, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), University of Oulu, Sahidullah, Md, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Computer science, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Speech recognition, Feature extraction, 0211 other engineering and technologies, Word error rate, 02 engineering and technology, Presentation Attack De-tection, Mel-Frequency Cepstral Coefficients (MFCCs), Bottleneck, Feature Extraction, [INFO.INFO-CV] Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, 021110 strategic, defence & security studies, [INFO.INFO-MM] Computer Science [cs]/Multimedia [cs.MM], Audiovisual Synchrony, Multimodal Processing, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 020207 software engineering, Mel-frequency cepstrum, Focus (optics), Canonical correlation, Joint (audio engineering), Energy (signal processing)
Abstract: International audience; Audiovisual speech synchrony detection is an important part of talking-face verification systems. Prior work has primarily focused on visual features and joint-space models, while standard mel-frequency cepstral coefficients (MFCCs) have been commonly used to present speech. We focus more closely on audio by studying the impact of context window length for delta feature computation and comparing MFCCs with simpler energy-based features in lip-sync detection. We select state-of-the-art hand-crafted lip-sync visual features, space-time auto-correlation of gradients (STACOG), and canonical correlation analysis (CCA), for joint-space modeling. To enhance joint space modeling, we adopt deep CCA (DCCA), a nonlinear extension of CCA. Our results on the XM2VTS data indicate substantially enhanced audiovisual speech synchrony detection, with an equal error rate (EER) of 3.68%. Further analysis reveals that failed lip region localization and beard-edness of the subjects constitutes most of the errors. Thus, the lip motion description is the bottleneck, while the use of novel audio features or joint-modeling techniques is unlikely to boost lip-sync detection accuracy further.
Published: 2018

43. A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

Author: Zhen-Hua Ling, Daisuke Saito, Tomoki Toda, Fernando Villavicencio, Junichi Yamagishi, Jaime Lorenzo-Trueba, and Tomi Kinnunen
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Spoofing attack, Computer science, media_common.quotation_subject, Speech recognition, Word error rate, Machine Learning (stat.ML), 02 engineering and technology, 01 natural sciences, Computer Science - Sound, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, 0103 physical sciences, Cepstrum, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Quality (business), 010301 acoustics, media_common, Artifact (error), Computer Science - Computation and Language, Process (computing), 020206 networking & telecommunications, Benchmarking, Benchmark (computing), Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion (VC) aims at conversion of speaker characteristic without altering content. Due to training data limitations and modeling imperfections, it is difficult to achieve believable speaker mimicry without introducing processing artifacts; performance assessment of VC, therefore, usually involves both speaker similarity and quality evaluation by a human panel. As a time-consuming, expensive, and non-reproducible process, it hinders rapid prototyping of new VC technology. We address artifact assessment using an alternative, objective approach leveraging from prior work on spoofing countermeasures (CMs) for automatic speaker verification. Therein, CMs are used for rejecting `fake' inputs such as replayed, synthetic or converted speech but their potential for automatic speech artifact assessment remains unknown. This study serves to fill that gap. As a supplement to subjective results for the 2018 Voice Conversion Challenge (VCC'18) data, we configure a standard constant-Q cepstral coefficient CM to quantify the extent of processing artifacts. Equal error rate (EER) of the CM, a confusability index of VC samples with real human speech, serves as our artifact measure. Two clusters of VCC'18 entries are identified: low-quality ones with detectable artifacts (low EERs), and higher quality ones with less artifacts. None of the VCC'18 systems, however, is perfect: all EERs are < 30 % (the `ideal' value would be 50 %). Our preliminary findings suggest potential of CMs outside of their original application, as a supplemental optimization and benchmarking tool to enhance VC technology., Correction (bug fix) of a published ODYSSEY 2018 publication with the same title and author list; more details in footnote in page 1
Published: 2018
Full Text: View/download PDF

44. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

Author: Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Tomi Kinnunen, Zhen-Hua Ling, Daisuke Saito, and Fernando Villavicencio
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, media_common.quotation_subject, Speech recognition, Machine Learning (stat.ML), 02 engineering and technology, Computer Science - Sound, Naturalness, Rule-based machine translation, Statistics - Machine Learning, Audio and Speech Processing (eess.AS), Perception, Similarity (psychology), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Spoke-hub distribution paradigm, media_common, Computer Science - Computation and Language, 020206 networking & telecommunications, Common framework, Task (computing), Identity (object-oriented programming), 020201 artificial intelligence & image processing, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems. The objective of the challenge was to perform speaker conversion (i.e. transform the vocal identity) of a source speaker to a target speaker while maintaining linguistic information. As an update to the previous challenge, we considered both parallel and non-parallel data to form the Hub and Spoke tasks, respectively. A total of 23 teams from around the world submitted their systems, 11 of them additionally participated in the optional Spoke task. A large-scale crowdsourced perceptual evaluation was then carried out to rate the submitted converted speech in terms of naturalness and similarity to the target speaker identity. In this paper, we present a brief summary of the state-of-the-art techniques for VC, followed by a detailed explanation of the challenge tasks and the results that were obtained., Comment: Accepted for Speaker Odyssey 2018
Published: 2018
Full Text: View/download PDF

45. ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements

Author: Kong Aik Lee, Sahidullah, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Héctor Delgado, Massimiliano Todisco, Eurecom [Sophia Antipolis], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), University of Eastern Finland, NEC Corporation, National Institute of Informatics (NII), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Spoofing attack, Computer science, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], Word error rate, 020206 networking & telecommunications, Speech synthesis, 02 engineering and technology, computer.software_genre, Mixture model, Reduction (complexity), Metadata, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Cepstrum, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0202 electrical engineering, electronic engineering, information engineering, Data mining, 0305 other medical science, Baseline (configuration management), computer
Abstract: International audience; The now-acknowledged vulnerabilities of automatic speaker verification (ASV) technology to spoofing attacks have spawned interests to develop so-called spoofing countermeasures. By providing common databases, protocols and metrics for their assessment, the ASVspoof initiative was born to spear-head research in this area. The first competitive ASVspoof challenge held in 2015 focused on the assessment of countermeasures to protect ASV technology from voice conversion and speech synthesis spoofing attacks. The second challenge switched focus to the consideration of replay spoofing attacks and countermeasures. This paper describes Version 2.0 of the ASVspoof 2017 database which was released to correct data anomalies detected post-evaluation. The paper contains as-yet unpublished meta-data which describes recording and playback devices and acoustic environments. These support the analysis of replay detection performance and limits. Also described are new results for the official ASVspoof baseline system which is based upon a constant Q cesptral coefficient frontend and a Gaussian mixture model backend. Reported are enhancements to the baseline system in the form of log-energy coefficients and cepstral mean and variance normalisation in addition to an alternative i-vector backend. The best results correspond to a 48% relative reduction in equal error rate when compared to the original baseline system.
Published: 2018
Full Text: View/download PDF

46. A Regression Model of Recurrent Deep Neural Networks for Noise Robust Estimation of the Fundamental Frequency Contour of Speech

Author: Akihiro Kato and Tomi Kinnunen
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Language identification, Artificial neural network, Computer science, Speech recognition, Supervised learning, Machine Learning (stat.ML), Speech synthesis, Fundamental frequency, computer.software_genre, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, Noise, Recurrent neural network, Statistics - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0305 other medical science, Hidden Markov model, Computation and Language (cs.CL), computer, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The fundamental frequency (F0) contour of speech is a key aspect to represent speech prosody that finds use in speech and spoken language analysis such as voice conversion and speech synthesis as well as speaker and language identification. This work proposes new methods to estimate the F0 contour of speech using deep neural networks (DNNs) and recurrent neural networks (RNNs). They are trained using supervised learning with the ground truth of F0 contours. The latest prior research addresses this problem first as a frame-by-frame-classification problem followed by sequence tracking using deep neural network hidden Markov model (DNN-HMM) hybrid architecture. This study, however, tackles the problem as a regression problem instead, in order to obtain F0 contours with higher frequency resolution from clean and noisy speech. Experiments using PTDB-TUG corpus contaminated with additive noise (NOISEX-92) show the proposed method improves gross pitch error (GPE) by more than 25 % at signal-to-noise ratios (SNRs) between -10 dB and +10 dB as compared with one of the most noise-robust F0 trackers, PEFAC. Furthermore, the performance on fine pitch error (FPE) is improved by approximately 20 % against a state-of-the-art DNN-HMM-based approach.
Published: 2018
Full Text: View/download PDF

47. Supervector Compression Strategies to Speed up I-Vector System Development

Author: Ville Vestman and Tomi Kinnunen
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Speedup, Computer science, Machine Learning (stat.ML), 02 engineering and technology, Computer Science - Sound, Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), Statistics - Machine Learning, Compression (functional analysis), Partial least squares regression, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Maximum a posteriori estimation, Computer Science - Computation and Language, business.industry, Probabilistic logic, 020206 networking & telecommunications, Pattern recognition, Mixture model, Computer Science - Learning, Principal component analysis, NIST, 020201 artificial intelligence & image processing, Artificial intelligence, business, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The front-end factor analysis (FEFA), an extension of principal component analysis (PPCA) tailored to be used with Gaussian mixture models (GMMs), is currently the prevalent approach to extract compact utterance-level features (i-vectors) for automatic speaker verification (ASV) systems. Little research has been conducted comparing FEFA to the conventional PPCA applied to maximum a posteriori (MAP) adapted GMM supervectors. We study several alternative methods, including PPCA, factor analysis (FA), and two supervised approaches, supervised PPCA (SPPCA) and the recently proposed probabilistic partial least squares (PPLS), to compress MAP-adapted GMM supervectors. The resulting i-vectors are used in ASV tasks with a probabilistic linear discriminant analysis (PLDA) back-end. We experiment on two different datasets, on the telephone condition of NIST SRE 2010 and on the recent VoxCeleb corpus collected from YouTube videos containing celebrity interviews recorded in various acoustical and technical conditions. The results suggest that, in terms of ASV accuracy, the supervector compression approaches are on a par with FEFA. The supervised approaches did not result in improved performance. In comparison to FEFA, we obtained more than hundred-fold (100x) speedups in the total variability model (TVM) training using the PPCA and FA supervector compression approaches., To appear in Speaker Odyssey 2018: The Speaker and Language Recognition Workshop
Published: 2018
Full Text: View/download PDF

48. i-Vector Modeling of Speech Attributes for Automatic Foreign Accent Recognition

Author: Chin-Hui Lee, Ville Hautamäki, Tomi Kinnunen, Hamid Behravan, and Sabato Marco Siniscalchi
Subjects: Acoustics and Ultrasonics, Computer science, Speech recognition, 02 engineering and technology, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Feature (machine learning), Speech analytics, Electrical and Electronic Engineering, Set (psychology), business.industry, 020206 networking & telecommunications, Speech corpus, VoxForge, Speech processing, Computational Mathematics, Accent (music), Artificial intelligence, 0305 other medical science, business, computer, Natural language processing, Utterance
Abstract: We propose a unified approach to automatic foreign accent recognition. It takes advantage of recent technology advances in both linguistics and acoustics based modeling techniques in automatic speech recognition (ASR) while overcoming the issue of a lack of a large set of transcribed data often required in designing state-of-the-art ASR systems. The key idea lies in defining a common set of fundamental units “universally” across all spoken accents such that any given spoken utterance can be transcribed with this set of “accent-universal” units. In this study, we adopt a set of units describing manner and place of articulation as speech attributes. These units exist in most spoken languages and they can be reliably modeled and extracted to represent foreign accent cues. We also propose an i-vector representation strategy to model the feature streams formed by concatenating these units. Testing on both the Finnish national foreign language certificate (FSD) corpus and the English NIST 2008 SRE corpus, the experimental results with the proposed approach demonstrate a significant system performance improvement with p-value $ over those with the conventional spectrum-based techniques. We observed up to a 15% relative error reduction over the already very strong i-vector accented recognition system when only manner information is used. Additional improvement is obtained by adding place of articulation clues along with context information. Furthermore, diagnostic information provided by the proposed approach can be useful to the designers to further enhance the system performance.
Published: 2016
Full Text: View/download PDF

49. Automatic versus human speaker verification: The case of voice mimicry

Author: Ville Hautamäki, Anne-Maria Laukkanen, Rosa González Hautamäki, and Tomi Kinnunen
Subjects: Linguistics and Language, Speaker verification, Computer science, business.industry, Communication, Speech recognition, Word error rate, Speaker recognition, computer.software_genre, Language and Linguistics, Computer Science Applications, Probabilistic linear discriminant analysis, Modeling and Simulation, Mimicry, Active listening, Statistical analysis, Computer Vision and Pattern Recognition, Artificial intelligence, business, Classifier (UML), computer, Software, Natural language processing
Abstract: In this work, we compare the performance of three modern speaker verification systems and non-expert human listeners in the presence of voice mimicry. Our goal is to gain insights on how vulnerable speaker verification systems are to mimicry attack and compare it to the performance of human listeners. We study both traditional Gaussian mixture model-universal background model (GMM-UBM) and an i-vector based classifier with cosine scoring and probabilistic linear discriminant analysis (PLDA) scoring. For the studied material in Finnish language, the mimicry attack decreased lightly the equal error rate (EER) for GMM-UBM from 10.83 to 10.31, while for i-vector systems the EER increased from 6.80 to 13.76 and from 4.36 to 7.38. The performance of the human listening panel shows that imitated speech increases the difficulty of the speaker verification task. It is even more difficult to recognize a person who is intentionally concealing his or her identity. For Impersonator A, the average listener made 8 errors from 34 trials while the automatic systems had 6 errors in the same set. The average listener for Impersonator B made 7 errors from the 28 trials, while the automatic systems made 7 to 9 errors. A statistical analysis of the listener performance was also conducted. We found out a statistically significant association, with p ¼ 0:00019 and R 2 ¼ 0:59, between listener accuracy and self reported factors only when familiar voices were present in the test.
Published: 2015
Full Text: View/download PDF

50. Joint Speaker Verification and Antispoofing in the <tex-math notation='LaTeX'>$i$ </tex-math>-Vector Space

Author: Elie Khoury, Aleksandr Sizov, Tomi Kinnunen, Sébastien Marcel, and Zhizheng Wu
Subjects: Spoofing attack, Biometrics, Computer Networks and Communications, Computer science, Speech recognition, Feature extraction, 02 engineering and technology, Speaker recognition, 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, 0202 electrical engineering, electronic engineering, information engineering, NIST, 020201 artificial intelligence & image processing, 0305 other medical science, Safety, Risk, Reliability and Quality, Subspace topology, Natural language
Abstract: Any biometric recognizer is vulnerable to spoofing attacks and hence voice biometric, also called automatic speaker verification (ASV), is no exception; replay, synthesis, and conversion attacks all provoke false acceptances unless countermeasures are used. We focus on voice conversion (VC) attacks considered as one of the most challenging for modern recognition systems. To detect spoofing, most existing countermeasures assume explicit or implicit knowledge of a particular VC system and focus on designing discriminative features. In this paper, we explore back-end generative models for more generalized countermeasures. In particular, we model synthesis-channel subspace to perform speaker verification and antispoofing jointly in the ${i}$ -vector space, which is a well-established technique for speaker modeling. It enables us to integrate speaker verification and antispoofing tasks into one system without any fusion techniques. To validate the proposed approach, we study vocoder-matched and vocoder-mismatched ASV and VC spoofing detection on the NIST 2006 speaker recognition evaluation data set. Promising results are obtained for standalone countermeasures as well as their combination with ASV systems using score fusion and joint approach.
Published: 2015
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

131 results on '"Tomi Kinnunen"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources