1. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech
- Author
-
Hirokazu Kameoka, Hsin-Te Hwang, Driss Matrouf, Markus Becker, Quan Wang, Sahidullah, Ye Jia, Yu Zhang, Lauri Juvela, Hsin-Min Wang, Wen-Chin Huang, Zhen-Hua Ling, Yuan Jiang, Yi-Chiao Wu, Héctor Delgado, Massimiliano Todisco, Yu Tsao, Li-Juan Liu, Junichi Yamagishi, Jean-François Bonastre, Tomoki Toda, Nicholas Evans, Robert A. J. Clark, Kai Onuma, Yu-Huai Peng, Sébastien Le Maguer, Avashna Govender, Takashi Kaneda, Andreas Nautsch, Kong Aik Lee, Xin Wang, Srikanth Ronanki, Ville Vestman, Koji Mushika, Ingmar Steiner, Tomi Kinnunen, Fergus Henderson, Jing-Xuan Zhang, Kou Tanaka, Paavo Alku, Hitotsubashi University, University of Edinburgh, Eurecom [Sophia Antipolis], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), University of Eastern Finland, NEC Corporation, Aalto University, Academia Sinica, ADAPT Centre, Sigmedia Lab, EE Engineering, Trinity College Dublin, Google Inc [Mountain View], Research at Google, Hoya Corp., iFlytek Research, Nagoya City University [Nagoya, Japan], NTT Communication Science Laboratories, NTT Corporation, audEERING GmbH, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, The Centre for Speech Technology Research [Edinburgh] (CSTR), Southern University of Science and Technology (SUSTech), The work was partially supported by JST CREST Grant No. JPMJCR18A6 (VoicePersonae project), Japan, MEXT KAKENHI Grant Nos. (16H06302, 16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan, the VoicePersonae and RESPECT projects funded by the French Agence Nationale de la Recherche (ANR), the Academy of Finland (NOTCH project no. 309629), and Region Grand Est, France. entitled 'NOTCH: NOn-cooperaTive speaker CHaracterization'). The authors at the University of Eastern Finland also gratefully acknowledge the use of the computational infrastructures at CSC – the IT Center for Science, and the support of the NVIDIA Corporation the donation of a Titan V GPU used in this research. The numerical calculations of some of the spoofed data were carried out on the TSUBAME3.0 supercomputer at the Tokyo Institute of Technology. The work is also partially supported by Region Grand Est, France. The ADAPT centre (13/RC/2106) is funded by the Science Foundation Ireland (SFI)., National Institute of Informatics, EURECOM, Université de Lorraine, Dept Signal Process and Acoust, Trinity College Dublin, Google, USA, HOYA Corporation, IFLYTEK Co., Ltd., Nagoya University, AudEERING GmbH, Avignon Université, University of Science and Technology of China, Aalto-yliopisto, and Southern University of Science and Technology of China (SUSTech)
- Subjects
Signal Processing (eess.SP) ,FOS: Computer and information sciences ,Sound (cs.SD) ,ASVspoof challenge ,biometrics ,Computer Science - Cryptography and Security ,voice conversion ,Computer science ,Speech synthesis ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Computer Science - Sound ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,text-to-speech synthesis ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,Replay ,Use case ,media forensics ,010301 acoustics ,Protocol (object-oriented programming) ,Text-to-speech synthesis ,Database ,presentation attack ,ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS ,Automatic speaker verification ,Cryptography and Security (cs.CR) ,[SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing ,Electrical Engineering and Systems Science - Audio and Speech Processing ,automatic speaker verification ,Voice conversion ,Spoofing attack ,Biometrics ,anti-spoofing ,Reliability (computer networking) ,Database design ,Theoretical Computer Science ,replay ,presentation attack detection ,0103 physical sciences ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical Engineering and Systems Science - Signal Processing ,[SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph] ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,[INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] ,020206 networking & telecommunications ,Human-Computer Interaction ,Physical access ,computer ,countermeasure ,Software - Abstract
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects., Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114
- Published
- 2020
- Full Text
- View/download PDF