Author: "Jung, Jee-weon" - Searchworks@Jio Institute Digital Library Search Results

1. ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

Author: Shi, Jiatong, Tian, Jinchuan, Wu, Yihan, Jung, Jee-weon, Yip, Jia Qi, Masuyama, Yoshiki, Chen, William, Wu, Yuning, Tang, Yuxun, Baali, Massa, Alharhi, Dareen, Zhang, Dong, Deng, Ruifan, Srivastava, Tejes, Wu, Haibin, Liu, Alexander H., Raj, Bhiksha, Jin, Qin, Song, Ruihua, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications., Comment: Accepted by SLT
Published: 2024

2. SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

Author: Jung, Jee-weon, Wu, Yihan, Wang, Xin, Kim, Ji-Hoon, Maiti, Soumi, Matsunaga, Yuta, Shim, Hye-jin, Tian, Jinchuan, Evans, Nicholas, Chung, Joon Son, Zhang, Wangyou, Um, Seyun, Takamichi, Shinnosuke, and Watanabe, Shinji
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, existing datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Existing SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. We present SpoofCeleb, which leverages a fully automated pipeline that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. The resulting SpoofCeleb dataset comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We provide baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb., Comment: 9 pages, 2 figures, 8 tables
Published: 2024

3. Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

Author: Aldeneh, Zakaria, Higuchi, Takuya, Jung, Jee-weon, Chen, Li-Wei, Shum, Stephen, Abdelaziz, Ahmed Hussen, Watanabe, Shinji, Likhomanenko, Tatiana, and Theobald, Barry-John
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Iterative self-training, or iterative pseudo-labeling (IPL)--using an improved model from the current iteration to provide pseudo-labels for the next iteration--has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameters tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods., Comment: Submitted to ICASSP 2025
Published: 2024

4. Text-To-Speech Synthesis In The Wild

Author: Jung, Jee-weon, Zhang, Wangyou, Maiti, Soumi, Wu, Yihan, Wang, Xin, Kim, Ji-Hoon, Matsunaga, Yuta, Um, Seyun, Tian, Jinchuan, Shim, Hye-jin, Evans, Nicholas, Chung, Joon Son, Takamichi, Shinnosuke, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms. The recent literature nonetheless shows efforts to train TTS systems using data collected in the wild. While this approach allows for the use of massive quantities of natural speech, until now, there are no common datasets. We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, in this case, applied to the VoxCeleb1 dataset commonly used for speaker recognition. We further propose two training sets. TITW-Hard is derived from the transcription, segmentation, and selection of VoxCeleb1 source data. TITW-Easy is derived from the additional application of enhancement and additional data selection based on DNSMOS. We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard. Both the dataset and protocols are publicly available and support the benchmarking of TTS systems trained using TITW data., Comment: 5 pages, submitted to ICASSP 2025 as a conference paper
Published: 2024

5. The VoxCeleb Speaker Recognition Challenge: A Retrospective

Author: Huh, Jaesung, Chung, Joon Son, Nagrani, Arsha, Brown, Andrew, Jung, Jee-weon, Garcia-Romero, Daniel, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year's special focus affected participants' performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges. Project page : https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/workshop.html, Comment: TASLP 2024
Published: 2024
Full Text: View/download PDF

6. ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Author: Wang, Xin, Delgado, Hector, Tak, Hemlata, Jung, Jee-weon, Shim, Hye-jin, Todisco, Massimiliano, Kukanov, Ivan, Liu, Xuechen, Sahidullah, Md, Kinnunen, Tomi, Evans, Nicholas, Lee, Kong Aik, and Yamagishi, Junichi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements., Comment: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)
Published: 2024

7. Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

Author: Shim, Hye-jin, Sahidullah, Md, Jung, Jee-weon, Watanabe, Shinji, and Kinnunen, Tomi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend class-wise interpretations beyond silence. We employ loss analysis and asymmetric methodologies to move away from traditional attack-focused and result-oriented evaluations towards a deeper examination of model behaviors. Our investigations highlight the significant differences in training dynamics between the two classes, emphasizing the need for future research to focus on robust modeling of the bonafide class., Comment: 5 pages, 1 figure, 5 tables, ISCA Interspeech 2024 SynData4GenAI Workshop
Published: 2024

8. Disentangled Representation Learning for Environment-agnostic Speaker Recognition

Author: Nam, KiHyun, Heo, Hee-Soo, Jung, Jee-weon, and Chung, Joon Son
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation - used as the refined embedding - condenses only the speaker characteristics. We show the versatility of our framework through its compatibility with any existing speaker embedding extractor, requiring no structural modifications or adaptations for integration. We validate the effectiveness of our framework by incorporating it into two popularly used embedding extractors and conducting experiments across various benchmarks. The results show a performance improvement of up to 16%. We release our code for this work to be available https://github.com/kaistmm/voxceleb-disentangler, Comment: Interspeech 2024. The official webpage can be found at https://mm.kaist.ac.kr/projects/voxceleb-disentangler/
Published: 2024

9. On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Author: Arora, Siddhant, Pasad, Ankita, Chien, Chung-Ming, Han, Jionghao, Sharma, Roshan, Jung, Jee-weon, Dhamyal, Hira, Chen, William, Shon, Suwon, Lee, Hung-yi, Livescu, Karen, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies., Comment: Accepted at ACL Findings 2024
Published: 2024

10. To what extent can ASV systems naturally defend against spoofing attacks?

Author: Jung, Jee-weon, Wang, Xin, Evans, Nicholas, Watanabe, Shinji, Shim, Hye-jin, Tak, Hemlata, Arora, Sidhhant, Yamagishi, Junichi, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically exploring diverse ASV systems and spoofing attacks, ranging from traditional to cutting-edge techniques. Through extensive analyses conducted on eight distinct ASV systems and 29 spoofing attack systems, we demonstrate that the evolution of ASV inherently incorporates defense mechanisms against spoofing attacks. Nevertheless, our findings also underscore that the advancement of spoofing attacks far outpaces that of ASV systems, hence necessitating further research on spoofing-robust ASV methodologies., Comment: 5 pages, 3 figures, 3 tables, Interspeech 2024
Published: 2024

11. Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Author: Zhang, Wangyou, Saijo, Kohei, Jung, Jee-weon, Li, Chenda, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures., Comment: 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024
Published: 2024
Full Text: View/download PDF

12. a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Author: Shim, Hye-jin, Jung, Jee-weon, Kinnunen, Tomi, Evans, Nicholas, Bonastre, Jean-Francois, and Lapidot, Itshak
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Spoofing detection is today a mainstream research topic. Standard metrics can be applied to evaluate the performance of isolated spoofing detection solutions and others have been proposed to support their evaluation when they are combined with speaker detection. These either have well-known deficiencies or restrict the architectural approach to combine speaker and spoof detectors. In this paper, we propose an architecture-agnostic detection cost function (a-DCF). A generalisation of the original DCF used widely for the assessment of automatic speaker verification (ASV), the a-DCF is designed for the evaluation of spoofing-robust ASV. Like the DCF, the a-DCF reflects the cost of decisions in a Bayes risk sense, with explicitly defined class priors and detection cost model. We demonstrate the merit of the a-DCF through the benchmarking evaluation of architecturally-heterogeneous spoofing-robust ASV solutions., Comment: 8 pages, submitted to Speaker Odyssey 2024
Published: 2024

13. TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Author: Kim, Minsu, Jung, Jee-weon, Rha, Hyeongseop, Maiti, Soumi, Arora, Siddhant, Chang, Xuankai, Watanabe, Shinji, and Ro, Yong Man
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
Published: 2024

14. Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

Author: Aldeneh, Zakaria, Higuchi, Takuya, Jung, Jee-weon, Seto, Skyler, Likhomanenko, Tatiana, Shum, Stephen, Abdelaziz, Ahmed Hussen, Watanabe, Shinji, and Theobald, Barry-John
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the design of the downstream model for speaker verification using self-supervised features. We show that we can simplify the model to use 97.51% fewer parameters while achieving a 29.93% average improvement in performance on SUPERB. Consequently, we show that the simplified downstream model is more data efficient compared to baseline--it achieves better performance with only 60% of the training data.
Published: 2024

15. ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

Author: Jung, Jee-weon, Zhang, Wangyou, Shi, Jiatong, Aldeneh, Zakaria, Higuchi, Takuya, Theobald, Barry-John, Abdelaziz, Ahmed Hussen, and Watanabe, Shinji
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN., Comment: 5 pages, 3 figures, 7 tables, Interspeech 2024
Published: 2024

16. OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Author: Peng, Yifan, Tian, Jinchuan, Chen, William, Arora, Siddhant, Yan, Brian, Sudo, Yui, Shakeel, Muhammad, Choi, Kwanghee, Shi, Jiatong, Chang, Xuankai, Jung, Jee-weon, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder architectures. This work aims to improve the performance and efficiency of OWSM without additional data. We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%. We further reveal the emergent ability of OWSM v3.1 in zero-shot contextual biasing speech recognition. We also provide a model trained on a subset of data with low license restrictions. We will publicly release the code, pre-trained models, and training logs., Comment: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/
Published: 2024

17. Improving Design of Input Condition Invariant Speech Enhancement

Author: Zhang, Wangyou, Jung, Jee-weon, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet., Comment: Accepted by ICASSP 2024, 5 pages, 2 figures, 3 tables (corrected the results of no processing on CHiME-4 (Simu) in Table 2)
Published: 2024

18. AugSumm: towards generalizable speech summarization using synthetic labels from large language model

Author: Jung, Jee-weon, Sharma, Roshan, Chen, William, Raj, Bhiksha, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary for every recording. Generating multiple human references would be ideal to better represent the distribution statistically, but is impractical because annotation is expensive. We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation. First, we explore prompting strategies to generate synthetic summaries from ChatGPT. We validate the quality of synthetic summaries using multiple metrics including human evaluation, where we find that summaries generated using AugSumm are perceived as more valid to humans. Second, we develop methods to utilize synthetic summaries in training and evaluation. Experiments on How2 demonstrate that pre-training on synthetic summaries and fine-tuning on GT summaries improves ROUGE-L by 1 point on both GT and AugSumm-based test sets. AugSumm summaries are available at https://github.com/Jungjee/AugSumm., Comment: This work has been submitted to the IEEE ICASSP for possible publication. 5 pages
Published: 2024

19. Understanding Probe Behaviors through Variational Bounds of Mutual Information

Author: Choi, Kwanghee, Jung, Jee-weon, and Watanabe, Shinji
Subjects: Computer Science - Information Theory, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. First, we connect probing with the variational bounds of mutual information (MI) to relax the probe design, equating linear probing with fine-tuning. Then, we investigate empirical behaviors and practices of probing through our mathematical framework. We analyze the layer-wise performance curve being convex, which seemingly violates the data processing inequality. However, we show that the intermediate representations can have the biggest MI estimate because of the tradeoff between better separability and decreasing MI. We further suggest that the margin of linearly separable representations can be a criterion for measuring the "goodness of representation." We also compare accuracy with MI as the measuring criteria. Finally, we empirically validate our claims by observing the self-supervised speech models on retaining word and phoneme information., Comment: Accepted to ICASSP 2024, implementation available at https://github.com/juice500ml/information_probing
Published: 2023

20. UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

Author: Arora, Siddhant, Futami, Hayato, Jung, Jee-weon, Peng, Yifan, Sharma, Roshan, Kashiwagi, Yosuke, Tsunoo, Emiru, Livescu, Karen, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types., Comment: Accepted at NAACL 2024
Published: 2023

21. One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Author: Cornell, Samuele, Jung, Jee-weon, Watanabe, Shinji, and Squartini, Stefano
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and ``Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
Published: 2023

22. Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Author: Wu, Shih-Lun, Chang, Xuankai, Wichern, Gordon, Jung, Jee-weon, Germain, François, Roux, Jonathan Le, and Watanabe, Shinji
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of training data. During inference, we propose to employ nucleus sampling and a hybrid reranking algorithm, which has not been explored in AAC research. Combining our efforts, our model achieves a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split, and wins the 2023 DCASE AAC challenge., Comment: ICASSP 2024 camera-ready paper. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)
Published: 2023

23. Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Author: Peng, Yifan, Tian, Jinchuan, Yan, Brian, Berrebbi, Dan, Chang, Xuankai, Li, Xinjian, Shi, Jiatong, Arora, Siddhant, Chen, William, Sharma, Roshan, Zhang, Wangyou, Sudo, Yui, Shakeel, Muhammad, Jung, Jee-weon, Maiti, Soumi, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science., Comment: Accepted at ASRU 2023
Published: 2023

24. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Author: Maiti, Soumi, Peng, Yifan, Choi, Shukjae, Jung, Jee-weon, Chang, Xuankai, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work.
Published: 2023

25. Encoder-decoder multimodal speaker change detection

Author: Jung, Jee-weon, Seo, Soonshin, Heo, Hee-Soo, Kim, Geonmin, Kim, You Jin, Kwon, Young-ki, Lee, Minjae, and Lee, Bong-Jin
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are built upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture. Different to previous MMSCD works that extract speaker embeddings from extremely short audio segments, aligned to a single word, we use a speaker embedding extracted from 1.5s. A transformer decoder layer further improves the performance of an encoder-only MMSCD model. The proposed model achieves state-of-the-art results among studies that report SCD performance and is also on par with recent work that combines SCD with automatic speech recognition via human transcription., Comment: 5 pages, accepted for presentation at INTERSPEECH 2023
Published: 2023

26. Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Author: Shim, Hye-jin, Jung, Jee-weon, and Kinnunen, Tomi
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio anti-spoofing for automatic speaker verification aims to safeguard users' identities from spoofing attacks. Although state-of-the-art spoofing countermeasure(CM) models perform well on specific datasets, they lack generalization when evaluated with different datasets. To address this limitation, previous studies have explored large pre-trained models, which require significant resources and time. We aim to develop a compact but well-generalizing CM model that can compete with large pre-trained models. Our approach involves multi-dataset co-training and sharpness-aware minimization, which has not been investigated in this domain. Extensive experiments reveal that proposed method yield competitive results across various datasets while utilizing 4,000 times less parameters than the large pre-trained models., Comment: Interspeech 2023
Published: 2023

27. Towards single integrated spoofing-aware speaker verification embeddings

Author: Mun, Sung Hwan, Shim, Hye-jin, Tak, Hemlata, Wang, Xin, Liu, Xuechen, Sahidullah, Md, Jeong, Myeonghun, Han, Min Hyun, Todisco, Massimiliano, Lee, Kong Aik, Yamagishi, Junichi, Evans, Nicholas, Kinnunen, Tomi, Kim, Nam Soo, and Jung, Jee-weon
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge., Comment: Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline
Published: 2023

28. VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

Author: Huh, Jaesung, Brown, Andrew, Jung, Jee-weon, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.
Published: 2023

29. Absolute decision corrupts absolutely: conservative online speaker diarisation

Author: Kwon, Youngki, Heo, Hee-Soo, Lee, Bong-Jin, Kim, You Jin, and Jung, Jee-weon
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Our focus lies in developing an online speaker diarisation framework which demonstrates robust performance across diverse domains. In online speaker diarisation, outputs generated in real-time are irreversible, and a few misjudgements in the early phase of an input session can lead to catastrophic results. We hypothesise that cautiously increasing the number of estimated speakers is of paramount importance among many other factors. Thus, our proposed framework includes decreasing the number of speakers by one when the system judges that an increase in the past was faulty. We also adopt dual buffers, checkpoints and centroids, where checkpoints are combined with silhouette coefficients to estimate the number of speakers and centroids represent speakers. Again, we believe that more than one centroid can be generated from one speaker. Thus we design a clustering-based label matching technique to assign labels in real-time. The resulting system is lightweight yet surprisingly effective. The system demonstrates state-of-the-art performance on DIHARD 2 and 3 datasets, where it is also competitive in AMI and VoxConverse test sets., Comment: 5pages, 2 figure, 4 tables, submitted to ICASSP
Published: 2022

30. High-resolution embedding extractor for speaker diarisation

Author: Heo, Hee-Soo, Kwon, Youngki, Lee, Bong-Jin, Kim, You Jin, and Jung, Jee-weon
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from each speech segment. However, because of the sliding window approach, a segment easily includes two or more speakers owing to speaker change points. This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE), which extracts multiple high-resolution embeddings from each speech segment. Hee consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. The enhancer of HEE replaces the aggregation process; instead of a global pooling layer, the enhancer combines relative information to each frame via attention leveraging the global context. Extracted dense frame-level embeddings can each represent a speaker. Thus, multiple speakers can be represented by different frame-level features in each segment. We also propose an artificially generating mixture data training framework to train the proposed HEE. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set, except for one dataset, which we analyse that rapid speaker changes less exist., Comment: 5pages, 2 figure, 3 tables, submitted to ICASSP
Published: 2022

31. Disentangled representation learning for multilingual speaker recognition

Author: Nam, Kihyun, Kim, Youkyum, Huh, Jaesung, Heo, Hee Soo, Jung, Jee-weon, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. In this paper, we publish a large-scale evaluation set named VoxCeleb1-B derived from VoxCeleb that considers bilingual scenarios. We introduce an effective disentanglement learning strategy that combines adversarial and metric learning-based methods. This approach addresses the bilingual situation by disentangling language-related information from speaker representation while ensuring stable speaker representation learning. Our language-disentangled learning method only uses language pseudo-labels without manual information., Comment: Interspeech 2023
Published: 2022

32. In search of strong embedding extractors for speaker diarisation

Author: Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Huh, Jaesung, Brown, Andrew, Kwon, Youngki, Watanabe, Shinji, and Chung, Joon Son
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation scenario better. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input. One technique generates overlapped speech segments, and the other generates segments where two speakers utter sequentially. Extensive experimental results using three state-of-the-art speaker embedding extractors demonstrate that both proposed approaches are effective., Comment: 5pages, 1 figure, 2 tables, submitted to ICASSP
Published: 2022

33. Large-scale learning of generalised representations for speaker recognition

Author: Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Lee, Jaesong, Shim, Hye-jin, Kwon, Youngki, Chung, Joon Son, and Watanabe, Shinji
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be required. We investigate several new training data configurations combining a few existing datasets. The most extensive configuration includes over 87k speakers' 10.22k hours of speech. Four evaluation protocols are adopted to measure how the trained model performs in diverse scenarios. Through experiments, we find that MFA-Conformer with the least inductive bias generalises the best. We also show that training with proposed large data configurations gives better performance. A boost in generalisation is observed, where the average performance on four evaluation protocols improves by more than 20%. In addition, we also demonstrate that these models' performances can improve even further when increasing capacity., Comment: 5pages, 5 tables, submitted to ICASSP
Published: 2022

34. Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion

Author: Shim, Hye-jin, Tak, Hemlata, Liu, Xuechen, Heo, Hee-Soo, Jung, Jee-weon, Chung, Joon Son, Chung, Soo-Whan, Yu, Ha-Jin, Lee, Bong-Jin, Todisco, Massimiliano, Delgado, Héctor, Lee, Kong Aik, Sahidullah, Md, Kinnunen, Tomi, and Evans, Nicholas
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning has brought impressive progress in the study of both automatic speaker verification (ASV) and spoofing countermeasures (CM). Although solutions are mutually dependent, they have typically evolved as standalone sub-systems whereby CM solutions are usually designed for a fixed ASV system. The work reported in this paper aims to gauge the improvements in reliability that can be gained from their closer integration. Results derived using the popular ASVspoof2019 dataset indicate that the equal error rate (EER) of a state-of-the-art ASV system degrades from 1.63% to 23.83% when the evaluation protocol is extended with spoofed trials.%subjected to spoofing attacks. However, even the straightforward integration of ASV and CM systems in the form of score-sum and deep neural network-based fusion strategies reduce the EER to 1.71% and 6.37%, respectively. The new Spoofing-Aware Speaker Verification (SASV) challenge has been formed to encourage greater attention to the integration of ASV and CM systems as well as to provide a means to benchmark different solutions., Comment: 8 pages, accepted by Odyssey 2022
Published: 2022

35. Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification

Author: Mun, Sung Hwan, Jung, Jee-weon, Han, Min Hyun, and Kim, Nam Soo
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers of these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to this line of research utilising a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional layer to adaptively select the kernel size in a data-driven fashion. It is based on an attention mechanism which exploits both frequency and channel domain. We first apply existing SKA module to our baseline. Then we propose two SKA variants where the first variant is applied in front of the ECAPA-TDNN model and the other is combined with the Res2net backbone block. Through extensive experiments, we demonstrate that our two proposed SKA variants consistently improves the performance and are complementary when tested on three different evaluation protocols., Comment: Accepted by IEEE SLT 2022. 7 pages, 4 figures, 1 table. Code is available at https://github.com/msh9184/ska-tdnn.git
Published: 2022

36. SASV 2022: The First Spoofing-Aware Speaker Verification Challenge

Author: Jung, Jee-weon, Tak, Hemlata, Shim, Hye-jin, Heo, Hee-Soo, Lee, Bong-Jin, Chung, Soo-Whan, Yu, Ha-Jin, Evans, Nicholas, and Kinnunen, Tomi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The first spoofing-aware speaker verification (SASV) challenge aims to integrate research efforts in speaker verification and anti-spoofing. We extend the speaker verification scenario by introducing spoofed trials to the usual set of target and impostor trials. In contrast to the established ASVspoof challenge where the focus is upon separate, independently optimised spoofing detection and speaker verification sub-systems, SASV targets the development of integrated and jointly optimised solutions. Pre-trained spoofing detection and speaker verification models are provided as open source and are used in two baseline SASV solutions. Both models and baselines are freely available to participants and can be used to develop back-end fusion approaches or end-to-end solutions. Using the provided common evaluation protocol, 23 teams submitted SASV solutions. When assessed with target, bona fide non-target and spoofed non-target trials, the top-performing system reduces the equal error rate of a conventional speaker verification system from 23.83% to 0.13%. SASV challenge results are a testament to the reliability of today's state-of-the-art approaches to spoofing detection and speaker verification., Comment: 5 pages, 2 figures, 2 tables, submitted to Interspeech 2022 as a conference paper
Published: 2022

37. Curriculum learning for self-supervised speaker verification

Author: Heo, Hee-Soo, Jung, Jee-weon, Kang, Jingu, Kwon, Youngki, Kim, You Jin, Lee, Bong-Jin, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The goal of this paper is to train effective self-supervised speaker representations without identity labels. We propose two curriculum learning strategies within a self-supervised learning framework. The first strategy aims to gradually increase the number of speakers in the training phase by enlarging the used portion of the train dataset. The second strategy applies various data augmentations to more utterances within a mini-batch as the training proceeds. A range of experiments conducted using the DINO self-supervised framework on the VoxCeleb1 evaluation protocol demonstrates the effectiveness of our proposed curriculum learning strategies. We report a competitive equal error rate of 4.47% with a single-phase training, and we also demonstrate that the performance further improves to 1.84% by fine-tuning on a small labelled dataset., Comment: INTERSPEECH 2023. 5 pages, 3 figures, 4 tables
Published: 2022

38. Pushing the limits of raw waveform speaker recognition

Author: Jung, Jee-weon, Kim, You Jin, Heo, Hee-Soo, Lee, Bong-Jin, Kwon, Youngki, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs. The model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbone module and multi-layer feature aggregation. Our best model achieves an equal error rate of 0.89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin. We also explore the application of the proposed model in the context of self-supervised learning framework. Our self-supervised model outperforms single phase-based existing works in this line of research. Finally, we show that self-supervised pre-training is effective for the semi-supervised scenario where we only have a small set of labelled training data, along with a larger set of unlabelled examples., Comment: submitted to INTERSPEECH 2022 as a conference paper. 5 pages, 2 figures, 5 tables
Published: 2022

39. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Author: Tak, Hemlata, Todisco, Massimiliano, Wang, Xin, Jung, Jee-weon, Yamagishi, Junichi, and Evans, Nicholas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The performance of spoofing countermeasure systems depends fundamentally upon the use of sufficiently representative training data. With this usually being limited, current solutions typically lack generalisation to attacks encountered in the wild. Strategies to improve reliability in the face of uncontrolled, unpredictable attacks are hence needed. We report in this paper our efforts to use self-supervised learning in the form of a wav2vec 2.0 front-end with fine tuning. Despite initial base representations being learned using only bona fide data and no spoofed data, we obtain the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases. When combined with data augmentation,these results correspond to an improvement of almost 90% relative to our baseline system., Comment: Submitted to Speaker Odyssey Workshop 2022
Published: 2022

40. SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan

Author: Jung, Jee-weon, Tak, Hemlata, Shim, Hye-jin, Heo, Hee-Soo, Lee, Bong-Jin, Chung, Soo-Whan, Kang, Hong-Goo, Yu, Ha-Jin, Evans, Nicholas, and Kinnunen, Tomi
Subjects: Computer Science - Sound, Computer Science - Cryptography and Security, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: ASV (automatic speaker verification) systems are intrinsically required to reject both non-target (e.g., voice uttered by different speaker) and spoofed (e.g., synthesised or converted) inputs. However, there is little consideration for how ASV systems themselves should be adapted when they are expected to encounter spoofing attacks, nor when they operate in tandem with CMs (spoofing countermeasures), much less how both systems should be jointly optimised. The goal of the first SASV (spoofing-aware speaker verification) challenge, a special sesscion in ISCA INTERSPEECH 2022, is to promote development of integrated systems that can perform ASV and CM simultaneously., Comment: Evaluation plan of the SASV Challenge 2022. See this webpage for more information: https://sasv-challenge.github.io
Published: 2022

41. Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity

Author: Kim, You Jin, Heo, Hee-Soo, Jung, Jee-weon, Kwon, Youngki, Lee, Bong-Jin, and Chung, Joon Son
Subjects: Computer Science - Sound, Computer Science - Computation and Language
Abstract: The objective of this work is to train noise-robust speaker embeddings adapted for speaker diarisation. Speaker embeddings play a crucial role in the performance of diarisation systems, but they often capture spurious information such as noise, adversely affecting performance. Our previous work has proposed an auto-encoder-based dimensionality reduction module to help remove the redundant information. However, they do not explicitly separate such information and have also been found to be sensitive to hyper-parameter values. To this end, we propose two contributions to overcome these issues: (i) a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings; (ii) the use of speech activity vector to prevent the speaker code from representing the background noise. Through a range of experiments conducted on four datasets, our approach consistently demonstrates the state-of-the-art performance among models without system fusion., Comment: This paper was submitted to ICASSP 2023
Published: 2021

42. Multi-scale speaker embedding-based graph attention networks for speaker diarisation

Author: Kwon, Youngki, Heo, Hee-Soo, Jung, Jee-weon, Kim, You Jin, Lee, Bong-Jin, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: The objective of this work is effective speaker diarisation using multi-scale speaker embeddings. Typically, there is a trade-off between the ability to recognise short speaker segments and the discriminative power of the embedding, according to the segment length used for embedding extraction. To this end, recent works have proposed the use of multi-scale embeddings where segments with varying lengths are used. However, the scores are combined using a weighted summation scheme where the weights are fixed after the training phase, whereas the importance of segment lengths can differ with in a single session. To address this issue, we present three key contributions in this paper: (1) we propose graph attention networks for multi-scale speaker diarisation; (2) we design scale indicators to utilise scale information of each embedding; (3) we adapt the attention-based aggregation to utilise a pre-computed affinity matrix from multi-scale embeddings. We demonstrate the effectiveness of our method in various datasets where the speaker confusion which constitutes the primary metric drops over 10% in average relative compared to the baseline., Comment: 5 pages, 2 figures, submitted to ICASSP as a conference paper
Published: 2021

43. AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks

Author: Jung, Jee-weon, Heo, Hee-Soo, Tak, Hemlata, Shim, Hye-jin, Chung, Joon Son, Lee, Bong-Jin, Yu, Ha-Jin, and Evans, Nicholas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Artefacts that differentiate spoofed from bona-fide utterances can reside in spectral or temporal domains. Their reliable detection usually depends upon computationally demanding ensemble systems where each subsystem is tuned to some specific artefacts. We seek to develop an efficient, single system that can detect a broad range of different spoofing attacks without score-level ensembles. We propose a novel heterogeneous stacking graph attention layer which models artefacts spanning heterogeneous temporal and spectral domains with a heterogeneous attention mechanism and a stack node. With a new max graph operation that involves a competitive mechanism and an extended readout scheme, our approach, named AASIST, outperforms the current state-of-the-art by 20% relative. Even a lightweight variant, AASIST-L, with only 85K parameters, outperforms all competing systems., Comment: 5 pages, 1 figure, 3 tables, submitted to ICASSP2022
Published: 2021

44. End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

Author: Tak, Hemlata, Jung, Jee-weon, Patino, Jose, Kamble, Madhu, Todisco, Massimiliano, and Evans, Nicholas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs. The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals. Using a model-level graph fusion of spectral (S) and temporal (T) sub-graphs and a graph pooling strategy to improve discrimination, the proposed RawGAT-ST model achieves an equal error rate of 1.06 % for the ASVspoof 2019 logical access database. This is one of the best results reported to date and is reproducible using an open source implementation., Comment: Accepted in ASVspoof 2021 Workshop
Published: 2021

45. Attentive max feature map and joint training for acoustic scene classification

Author: Shim, Hye-jin, Jung, Jee-weon, Kim, Ju-ho, and Yu, Ha-Jin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Various attention mechanisms are being widely applied to acoustic scene classification. However, we empirically found that the attention mechanism can excessively discard potentially valuable information, despite improving performance. We propose the attentive max feature map that combines two effective techniques, attention and a max feature map, to further elaborate the attention mechanism and mitigate the above-mentioned phenomenon. We also explore various joint training methods, including multi-task learning, that allocate additional abstract labels for each audio recording. Our proposed system demonstrates state-of-the-art performance for single systems on Subtask A of the DCASE 2020 challenge by applying the two proposed techniques using relatively fewer parameters. Furthermore, adopting the proposed attentive max feature map, our team placed fourth in the recent DCASE 2021 challenge., Comment: 5 pages, 2 figures, 5 tables, submitted to ICASSP 2022
Published: 2021

46. Learning Metrics from Mean Teacher: A Supervised Learning Method for Improving the Generalization of Speaker Verification System

Author: Kim, Ju-ho, Shim, Hye-jin, Jung, Jee-weon, and Yu, Ha-Jin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Most speaker verification tasks are studied as an open-set evaluation scenario considering the real-world condition. Thus, the generalization power to unseen speakers is of paramount important to the performance of the speaker verification system. We propose to apply \textit {Mean Teacher}, a temporal averaging model, to extract speaker embeddings with small intra-class variance and large inter-class variance. The mean teacher network refers to the temporal averaging of deep neural network parameters; it can produces more accurate and stable representations than using weights after the training finished. By learning the reliable intermediate representation of the mean teacher network, we expect that the proposed method can explore more discriminatory embedding spaces and improve the generalization performance of the speaker verification system. Experimental results on the VoxCeleb1 test set demonstrate that the proposed method relatively improves performance by 11.61\%, compared to a baseline system., Comment: 5 pages, 1 figures, 5 tables, submitted to 2021 Interspeech as a conference paper
Published: 2021

47. Graph Attention Networks for Anti-Spoofing

Author: Tak, Hemlata, Jung, Jee-weon, Patino, Jose, Todisco, Massimiliano, and Evans, Nicholas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Sound
Abstract: The cues needed to detect spoofing attacks against automatic speaker verification are often located in specific spectral sub-bands or temporal segments. Previous works show the potential to learn these using either spectral or temporal self-attention mechanisms but not the relationships between neighbouring sub-bands or segments. This paper reports our use of graph attention networks (GATs) to model these relationships and to improve spoofing detection performance. GATs leverage a self-attention mechanism over graph structured data to model the data manifold and the relationships between nodes. Our graph is constructed from representations produced by a ResNet. Nodes in the graph represent information either in specific sub-bands or temporal segments. Experiments performed on the ASVspoof 2019 logical access database show that our GAT-based model with temporal attention outperforms all of our baseline single systems. Furthermore, GAT-based systems are complementary to a set of existing systems. The fusion of GAT-based models with more conventional countermeasures delivers a 47% relative improvement in performance compared to the best performing single GAT system., Comment: Submitted to INTERSPEECH 2021
Published: 2021

48. Adapting Speaker Embeddings for Speaker Diarisation

Author: Kwon, Youngki, Jung, Jee-weon, Heo, Hee-Soo, Kim, You Jin, Lee, Bong-Jin, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have directly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. A wide range of experiments is performed on various challenging datasets. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline., Comment: 5 pages, 2 figures, 3 tables, submitted to Interspeech as a conference paper
Published: 2021

49. Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network

Author: Jung, Jee-weon, Heo, Hee-Soo, Kwon, Youngki, Chung, Joon Son, and Lee, Bong-Jin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: In this work, we propose an overlapped speech detection system trained as a three-class classifier. Unlike conventional systems that perform binary classification as to whether or not a frame contains overlapped speech, the proposed approach classifies into three classes: non-speech, single speaker speech, and overlapped speech. By training a network with the more detailed label definition, the model can learn a better notion on deciding the number of speakers included in a given frame. A convolutional recurrent neural network architecture is explored to benefit from both convolutional layer's capability to model local patterns and recurrent layer's ability to model sequential information. The proposed overlapped speech detection model establishes a state-of-the-art performance with a precision of 0.6648 and a recall of 0.3222 on the DIHARD II evaluation set, showing a 20% increase in recall along with higher precision. In addition, we also introduce a simple approach to utilize the proposed overlapped speech detection model for speaker diarization which ranked third place in the Track 1 of the DIHARD III challenge., Comment: 5 pages, 2 figures, 4 tables, submitted to Interspeech as a conference paper
Published: 2021

50. Graph Attention Networks for Speaker Verification

Author: Jung, Jee-weon, Heo, Hee-Soo, Yu, Ha-Jin, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: This work presents a novel back-end framework for speaker verification using graph attention networks. Segment-wise speaker embeddings extracted from multiple crops within an utterance are interpreted as node representations of a graph. The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score. We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks. After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform, followed by a readout operation resulting in a scalar similarity score. To enable successful adaptation for speaker verification, we propose techniques such as separating trainable weights for attention map calculations between segment-wise speaker embeddings from different utterances. The effectiveness of the proposed framework is validated using three different speaker embedding extractors trained with different architectures and objective functions. Experimental results demonstrate consistent improvement over various baseline back-end classifiers, with an average equal error rate improvement of 20% over the cosine similarity back-end without test time augmentation., Comment: 5 pages, 1 figure, 2 tables, accepted for presentation at ICASSP 2021 as a conference paper
Published: 2020

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

196 results on '"Jung, Jee-weon"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources