Author: "Tan, Zheng-Hua" / Search Limiters: Academic (Peer-Reviewed) Journals - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tan, Zheng-Hua"' showing total 648 results

Start Over Author "Tan, Zheng-Hua" Search Limiters Academic (Peer-Reviewed) Journals

648 results on '"Tan, Zheng-Hua"'

1. BiSSL: Bilevel Optimization for Self-Supervised Pre-Training and Fine-Tuning

Author: Zakarias, Gustav Wagner, Hansen, Lars Kai, and Tan, Zheng-Hua
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By more explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better suited for the downstream task. We propose a training algorithm that alternates between optimizing the two objectives defined in BiSSL. Using a ResNet-18 backbone pre-trained with SimCLR on the STL10 dataset, we demonstrate that our proposed framework consistently achieves improved or competitive classification accuracies across various downstream image classification datasets compared to the conventional self-supervised learning pipeline. Qualitative analyses of the backbone features further suggest that BiSSL enhances the alignment of downstream features in the backbone prior to fine-tuning.
Published: 2024

2. Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models

Author: Kühne, Nikolai L., Kitchen, Astrid H. F., Jensen, Marie S., Brøndt, Mikkel S. L., Gonzalez, Martin, Biscio, Christophe, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic speech recognition (ASR) systems are known to be vulnerable to adversarial attacks. This paper addresses detection and defence against targeted white-box attacks on speech signals for ASR systems. While existing work has utilised diffusion models (DMs) to purify adversarial examples, achieving state-of-the-art results in keyword spotting tasks, their effectiveness for more complex tasks such as sentence-level ASR remains unexplored. Additionally, the impact of the number of forward diffusion steps on performance is not well understood. In this paper, we systematically investigate the use of DMs for defending against adversarial attacks on sentences and examine the effect of varying forward diffusion steps. Through comprehensive experiments on the Mozilla Common Voice dataset, we demonstrate that two forward diffusion steps can completely defend against adversarial attacks on sentences. Moreover, we introduce a novel, training-free approach for detecting adversarial attacks by leveraging a pre-trained DM. Our experimental results show that this method can detect adversarial attacks with high accuracy., Comment: Under review at ICASSP 2025
Published: 2024

3. Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Author: Xie, Yuying, Kuhlmann, Michael, Rautenberg, Frederik, Tan, Zheng-Hua, and Haeb-Umbach, Reinhold
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves unsupervised disentanglement of a speech signal into speaker and content embeddings by assuming speaker info to be temporally more stable than content-induced variations. However, this assumption may introduce other temporal stable information into the speaker embeddings, like environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker labels without the necessity for style labels. Experimental results validate the proposed method's effectiveness on extracting disentangled features, thereby facilitating speaker, style, or combined speaker-style conversion., Comment: Accepted by EUSIPCO 2024
Published: 2024

4. Audio xLSTMs: Learning Self-Supervised Audio Representations with xLSTMs

Author: Yadav, Sarthak, Theodoridis, Sergios, and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have also observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM architecture. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not yet been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach to learn audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 20% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters., Comment: Under review at ICASSP 2025. arXiv admin note: text overlap with arXiv:2406.02178
Published: 2024

5. Zero-Shot Audio Captioning Using Soft and Hard Prompts

Author: Zhang, Yiming, Xu, Xuenan, Du, Ruoyi, Liu, Haohe, Dong, Yuan, Tan, Zheng-Hua, Wang, Wenwu, and Ma, Zhanyu
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method., Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing
Published: 2024

6. The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

Author: Gonzalez, Philippe, Tan, Zheng-Hua, Østergaard, Jan, Jensen, Jesper, Alstrøm, Tommy Sonne, and May, Tobias
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but whether this is also required for speech enhancement is unknown. Moreover, studies that investigated the effect of training dataset size did not control for the data diversity. It is thus unclear whether the performance improvement was due to the increased dataset size or diversity. Therefore, we systematically investigate the effect of training dataset size on the performance of popular state-of-the-art discriminative and diffusion-based speech enhancement systems in matched conditions. We control for the data diversity by using a fixed set of speech utterances, noise segments and binaural room impulse responses to generate datasets of different sizes. We find that the diffusion-based systems perform the best relative to the discriminative systems in terms of objective metrics with datasets of 10 h or less. However, their objective metrics performance does not improve when increasing the training dataset size as much as the discriminative systems, and they are outperformed by the discriminative systems with datasets of 100 h or more., Comment: Accepted version
Published: 2024
Full Text: View/download PDF

7. Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations

Author: Yadav, Sarthak and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence length and model size comparisons., Comment: Accepted at INTERSPEECH 2024
Published: 2024

8. Noise-Robust Keyword Spotting through Self-supervised Pretraining

Author: Mørk, Jacob, Bovbjerg, Holger Severin, Kiss, Gergely, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, 68T10, I.2.6
Abstract: Voice assistants are now widely available, and to activate them a keyword spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance. Leveraging unlabelled data through self-supervised learning (SSL) has been shown to increase the accuracy in clean conditions. This paper explores how SSL pretraining such as Data2Vec can be used to enhance the robustness of KWS models in noisy conditions, which is under-explored. Models of three different sizes are pretrained using different pretraining approaches and then fine-tuned for KWS. These models are then tested and compared to models trained using two baseline supervised learning methods, one being standard training using clean data and the other one being multi-style training (MTR). The results show that pretraining and fine-tuning on clean data is superior to supervised learning on clean data across all testing conditions, and superior to supervised MTR for testing conditions of SNR above 5 dB. This indicates that pretraining alone can increase the model's robustness. Finally, it is found that using noisy data for pretraining models, especially with the Data2Vec-denoising approach, significantly enhances the robustness of KWS models in noisy conditions.
Published: 2024

9. How to train your ears: Auditory-model emulation for large-dynamic-range inputs and mild-to-severe hearing losses

Author: Leer, Peter, Jensen, Jesper, Tan, Zheng-Hua, Østergaard, Jan, and Bramsløw, Lars
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Advanced auditory models are useful in designing signal-processing algorithms for hearing-loss compensation or speech enhancement. Such auditory models provide rich and detailed descriptions of the auditory pathway, and might allow for individualization of signal-processing strategies, based on physiological measurements. However, these auditory models are often computationally demanding, requiring significant time to compute. To address this issue, previous studies have explored the use of deep neural networks to emulate auditory models and reduce inference time. While these deep neural networks offer impressive efficiency gains in terms of computational time, they may suffer from uneven emulation performance as a function of auditory-model frequency-channels and input sound pressure level, making them unsuitable for many tasks. In this study, we demonstrate that the conventional machine-learning optimization objective used in existing state-of-the-art methods is the primary source of this limitation. Specifically, the optimization objective fails to account for the frequency- and level-dependencies of the auditory model, caused by a large input dynamic range and different types of hearing losses emulated by the auditory model. To overcome this limitation, we propose a new optimization objective that explicitly embeds the frequency- and level-dependencies of the auditory model. Our results show that this new optimization objective significantly improves the emulation performance of deep neural networks across relevant input sound levels and auditory-model frequency channels, without increasing the computational load during inference. Addressing these limitations is essential for advancing the application of auditory models in signal-processing tasks, ensuring their efficacy in diverse scenarios., Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. This version is the authors' version and may vary from the final publication in details
Published: 2024
Full Text: View/download PDF

10. Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks

Author: Leer, Peter, Jensen, Jesper, Carney, Laurel, Tan, Zheng-Hua, Østergaard, Jan, and Bramsløw, Lars
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This article investigates the use of deep neural networks (DNNs) for hearing-loss compensation. Hearing loss is a prevalent issue affecting millions of people worldwide, and conventional hearing aids have limitations in providing satisfactory compensation. DNNs have shown remarkable performance in various auditory tasks, including speech recognition, speaker identification, and music classification. In this study, we propose a DNN-based approach for hearing-loss compensation, which is trained on the outputs of hearing-impaired and normal-hearing DNN-based auditory models in response to speech signals. First, we introduce a framework for emulating auditory models using DNNs, focusing on an auditory-nerve model in the auditory pathway. We propose a linearization of the DNN-based approach, which we use to analyze the DNN-based hearing-loss compensation. Additionally we develop a simple approach to choose the acoustic center frequencies of the auditory model used for the compensation strategy. Finally, we evaluate the DNN-based hearing-loss compensation strategies using listening tests with hearing impaired listeners. The results demonstrate that the proposed approach results in feasible hearing-loss compensation strategies. Our proposed approach was shown to provide an increase in speech intelligibility and was found to outperform a conventional approach in terms of perceived speech quality.
Published: 2024

11. Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

Author: Bovbjerg, Holger Severin, Jensen, Jesper, Østergaard, Jan, and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, 68T10, I.2.6
Abstract: In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD. The trained models are systematically evaluated on both clean speech and speech contaminated by various types of noise at different SNR-levels and compared to a purely supervised model. Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning., Comment: To be published at ICASSP2024, 14th of April 2024, Seoul, South Korea. Copyright (c) 2023 IEEE. 5 pages, 2, figures, 5 tables
Published: 2023

12. PAC-Bayes Generalisation Bounds for Dynamical Systems Including Stable RNNs

Author: Eringis, Deividas, Leth, John, Tan, Zheng-Hua, Wisniewski, Rafal, and Petreczky, Mihaly
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases. In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN., Comment: Accepted to AAAI2024 conference
Published: 2023
Full Text: View/download PDF

13. Investigating the Design Space of Diffusion Models for Speech Enhancement

Author: Gonzalez, Philippe, Tan, Zheng-Hua, Østergaard, Jan, Jensen, Jesper, Alstrøm, Tommy Sonne, and May, Tobias
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four., Comment: Accepted version
Published: 2023
Full Text: View/download PDF

14. Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler

Author: Gonzalez, Philippe, Tan, Zheng-Hua, Østergaard, Jan, Jensen, Jesper, Alstrøm, Tommy Sonne, and May, Tobias
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement. These include several design aspects of diffusion models, such as the noise schedule or the reverse sampler. In this work, we systematically assess the generalization performance of a diffusion-based speech enhancement model by using multiple speech, noise and binaural room impulse response (BRIR) databases to simulate mismatched acoustic conditions. We also experiment with a noise schedule and a sampler that have not been applied to speech enhancement before. We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions. We also show that a Heun-based sampler achieves superior performance at a smaller computational cost compared to a sampler commonly used for speech enhancement., Comment: Accepted to ICASSP 2024
Published: 2023
Full Text: View/download PDF

15. Joint Minimum Processing Beamforming and Near-end Listening Enhancement

Author: Fuglsig, Andreas J., Jensen, Jesper, Tan, Zheng-Hua, Bertelsen, Lars S., Lindof, Jens Christian, and Østergaard, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We consider speech enhancement for signals picked up in one noisy environment that must be rendered to a listener in another noisy environment. For both far-end noise reduction and near-end listening enhancement, it has been shown that excessive focus on noise suppression or intelligibility maximization may lead to excessive speech distortions and quality degradations in favorable noise conditions, where intelligibility is already at ceiling level. Recently [1,2] propose to remedy this with a minimum processing framework that either reduces noise or enhances listening a minimum amount given that a certain intelligibility criterion is still satisfied Additionally, it has been shown that joint consideration of both environments improves speech enhancement performance. In this paper, we formulate a joint far- and near-end minimum processing framework, that improves intelligibility while limiting speech distortions in favorable noise conditions. We provide closed-form solutions to specific boundary scenarios and investigate performance for the general case using numerical optimization. We also show concatenating existing minimum processing far- and near-end enhancement methods preserves the effects of the initial methods. Results show that the joint optimization can further improve performance compared to the concatenated approach., Comment: Accepted at IEEE ICASSP 2024 Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) 2024
Published: 2023

16. Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners

Author: Yadav, Sarthak, Theodoridis, Sergios, Hansen, Lars Kai, and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations, along with demonstrating considerably better scaling characteristics. Investigating attention distances and entropies reveals that MW-MAE encoders learn heads with broader local and global attention. Analyzing attention head feature representations through Projection Weighted Canonical Correlation Analysis (PWCCA) shows that attention heads with the same window sizes across the decoder layers of the MW-MAE learn correlated feature representations which enables each block to independently capture local and global information, leading to a decoupled decoder feature hierarchy. Code for feature extraction and downstream experiments along with pre-trained models will be released publically.
Published: 2023

17. Speech inpainting: Context-based speech synthesis guided by video

Author: Montesinos, Juan F., Michelsanti, Daniel, Haro, Gloria, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech., Comment: Accepted in Interspeech23
Published: 2023

18. PAC-Bayesian bounds for learning LTI-ss systems with input from empirical loss

Author: Eringis, Deividas, Leth, John, Tan, Zheng-Hua, Wisniewski, Rafael, and Petreczky, Mihaly
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: In this paper we derive a Probably Approxilmately Correct(PAC)-Bayesian error bound for linear time-invariant (LTI) stochastic dynamical systems with inputs. Such bounds are widespread in machine learning, and they are useful for characterizing the predictive power of models learned from finitely many data points. In particular, with the bound derived in this paper relates future average prediction errors with the prediction error generated by the model on the data used for learning. In turn, this allows us to provide finite-sample error bounds for a wide class of learning/system identification algorithms. Furthermore, as LTI systems are a sub-class of recurrent neural networks (RNNs), these error bounds could be a first step towards PAC-Bayesian bounds for RNNs., Comment: arXiv admin note: text overlap with arXiv:2212.14838
Published: 2023

19. PAC-Bayesian-Like Error Bound for a Class of Linear Time-Invariant Stochastic State-Space Models

Author: Eringis, Deividas, Leth, John, Tan, Zheng-Hua, Wisniewski, Rafal, and Petreczky, Mihaly
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Dynamical Systems, Mathematics - Statistics Theory
Abstract: In this paper we derive a PAC-Bayesian-Like error bound for a class of stochastic dynamical systems with inputs, namely, for linear time-invariant stochastic state-space models (stochastic LTI systems for short). This class of systems is widely used in control engineering and econometrics, in particular, they represent a special case of recurrent neural networks. In this paper we 1) formalize the learning problem for stochastic LTI systems with inputs, 2) derive a PAC-Bayesian-Like error bound for such systems, 3) discuss various consequences of this error bound.
Published: 2022

20. Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

Author: López-Espejo, Iván, Shekar, Ram C. M. C., Tan, Zheng-Hua, Jensen, Jesper, and Hansen, John H. L.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Computer Science - Sound
Abstract: In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3x energy consumption reduction.
Published: 2022

21. Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

Author: Xie, Yuying, Arildsen, Thomas, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Leveraging the fact that speaker identity and content vary on different time scales, \acrlong{fhvae} (\acrshort{fhvae}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance. By setting a small fixed variance, the training process promotes identity variables within one utterance gathering close to the mean of their prior. However, this constraint is relatively weak, as the mean of the prior changes between utterances. Therefore, we introduce contrastive learning into the \acrshort{fhvae} framework, to make the speaker identity variables gathering when representing the same speaker, while distancing themselves as far as possible from those of other speakers. The model structure has not been changed in this work but only the training process, thus no additional cost is needed during testing. Voice conversion has been chosen as the application in this paper. Latent variable evaluations include speaker verification and identification for the speaker identity variable, and speech recognition for the content variable. Furthermore, assessments of voice conversion performance are on the grounds of fake speech detection experiments. Results show that the proposed method improves both speaker identity and content feature extraction compared to \acrshort{fhvae}, and has better performance than baseline on conversion., Comment: accepted by EUSIPCO 2023
Published: 2022

22. Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise

Author: Nielsen, Christian Heider and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Machine Learning, Computer Science - Sound
Abstract: In recent years, significant progress has been made in deep model-based automatic speech recognition (ASR), leading to its widespread deployment in the real world. At the same time, adversarial attacks against deep ASR systems are highly successful. Various methods have been proposed to defend ASR systems from these attacks. However, existing classification based methods focus on the design of deep learning models while lacking exploration of domain specific features. This work leverages filter bank-based features to better capture the characteristics of attacks for improved detection. Furthermore, the paper analyses the potentials of using speech and non-speech parts separately in detecting adversarial attacks. In the end, considering adverse environments where ASR systems may be deployed, we study the impact of acoustic noise of various types and signal-to-noise ratios. Extensive experiments show that the inverse filter bank features generally perform better in both clean and noisy environments, the detection is effective using either speech or non-speech part, and the acoustic noise can largely degrade the detection performance.
Published: 2022

23. Minimum Processing Near-end Listening Enhancement

Author: Fuglsig, Andreas Jonas, Jensen, Jesper, Tan, Zheng-Hua, Bertelsen, Lars Søndergaard, Lindof, Jens Christian, and Østergaard, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment. By pre-processing the speech signal it is possible to improve the speech intelligibility and quality -- this is known as near-end listening enhancement (NLE). Although, existing NLE techniques are able to greatly increase intelligibility in harsh noise environments, in favorable noise conditions the intelligibility of speech reaches a ceiling where it cannot be further enhanced. Actually, the focus of existing methods solely on improving the intelligibility causes unnecessary processing of the speech signal and leads to speech distortions and quality degradations. In this paper, we provide a new rationale for NLE, where the target speech is minimally processed in terms of a processing penalty, provided that a certain performance constraint, e.g., intelligibility, is satisfied. We present a closed-form solution for the case where the performance criterion is an intelligibility estimator based on the approximated speech intelligibility index and the processing penalty is the mean-square error between the processed and the clean speech. This produces an NLE method that adapts to changing noise conditions via a simple gain rule by limiting the processing to the minimum necessary to achieve a desired intelligibility, while at the same time focusing on quality in favorable noise situations by minimizing the amount of speech distortions. Through simulation studies, we show the proposed method attains speech quality on par or better than existing methods in both objective measurements and subjective listening tests, whilst still sustaining objective speech intelligibility performance on par with existing methods.
Published: 2022
Full Text: View/download PDF

24. Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining

Author: Bovbjerg, Holger Severin and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, 68T10, I.2.6
Abstract: Keyword Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants. To achieve satisfactory performance, these models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available. Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled data. Most SSL methods for speech have primarily been studied for large models, whereas this is not ideal, as compact KWS models are generally required. This paper explores the effectiveness of SSL on small models for KWS and establishes that SSL can enhance the performance of small KWS models when labelled data is scarce. We pretrain three compact transformer-based KWS models using Data2Vec, and fine-tune them on a label-deficient setup of the Google Speech Commands data set. It is found that Data2Vec pretraining leads to a significant increase in accuracy, with label-deficient scenarios showing an improvement of 8.22% 11.18% absolute accuracy., Comment: To be published at ICASSP2023 Workshop on Self-supervision in Audio, Speech and Beyond, 10th of June 2023, Rhodes, Greece. Copyright (c) 2023 IEEE. 5 pages, 3 figures, 3 tables
Published: 2022

25. Adversarial Multi-Task Deep Learning for Noise-Robust Voice Activity Detection with Low Algorithmic Delay

Author: Larsen, Claus Meyer, Koch, Peter, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice Activity Detection (VAD) is an important pre-processing step in a wide variety of speech processing systems. VAD should in a practical application be able to detect speech in both noisy and noise-free environments, while not introducing significant latency. In this work we propose using an adversarial multi-task learning method when training a supervised VAD. The method has been applied to the state-of-the-art VAD Waveform-based Voice Activity Detection. Additionally the performance of the VADis investigated under different algorithmic delays, which is an important factor in latency. Introducing adversarial multi-task learning to the model is observed to increase performance in terms of Area Under Curve (AUC), particularly in noisy environments, while the performance is not degraded at higher SNR levels. The adversarial multi-task learning is only applied in the training phase and thus introduces no additional cost in testing. Furthermore the correlation between performance and algorithmic delays is investigated, and it is observed that the VAD performance degradation is only moderate when lowering the algorithmic delay from 398 ms to 23 ms.
Published: 2022

26. Floor Map Reconstruction Through Radio Sensing and Learning By a Large Intelligent Surface

Author: Vaca-Rubio, Cristian J., Pereira, Roberto, Mestre, Xavier, Gregoratti, David, Tan, Zheng-Hua, de Carvalho, Elisabeth, and Popovski, Petar
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Environmental scene reconstruction is of great interest for autonomous robotic applications, since an accurate representation of the environment is necessary to ensure safe interaction with robots. Equally important, it is also vital to ensure reliable communication between the robot and its controller. Large Intelligent Surface (LIS) is a technology that has been extensively studied due to its communication capabilities. Moreover, due to the number of antenna elements, these surfaces arise as a powerful solution to radio sensing. This paper presents a novel method to translate radio environmental maps obtained at the LIS to floor plans of the indoor environment built of scatterers spread along its area. The usage of a Least Squares (LS) based method, U-Net (UN) and conditional Generative Adversarial Networks (cGANs) were leveraged to perform this task. We show that the floor plan can be correctly reconstructed using both local and global measurements.
Published: 2022

27. User Localization using RF Sensing: A Performance comparison between LIS and mmWave Radars

Author: Vaca-Rubio, Cristian J., Salami, Dariush, Popovski, Petar, de Carvalho, Elisabeth, Tan, Zheng-Hua, and Sigg, Stephan
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc. Two emerging technologies in RF-sensing, namely sensing through Large Intelligent Surfaces (LISs) and mmWave Frequency-Modulated Continuous-Wave (FMCW) radars, have been successfully applied to a wide range of applications. In this work, we compare LIS and mmWave radars for localization in real-world and simulated environments. In our experiments, the mmWave radar achieves 0.71 Intersection Over Union (IOU) and 3cm error for bounding boxes, while LIS has 0.56 IOU and 10cm distance error. Although the radar outperforms the LIS in terms of accuracy, LIS features additional applications in communication in addition to sensing scenarios.
Published: 2022

28. Complex Recurrent Variational Autoencoder with Application to Speech Enhancement

Author: Xie, Yuying, Arildsen, Thomas, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: As an extension of variational autoencoder (VAE), complex VAE uses complex Gaussian distributions to model latent variables and data. This work proposes a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used. Firstly, to account for the temporal property of speech signals, this work introduces complex-valued recurrent neural network in the complex VAE framework. Besides, L1 loss is used as the reconstruction loss in this framework. To exemplify the use of the complex generative model in speech processing, we choose speech enhancement as the specific application in this paper. Experiments are based on the TIMIT dataset. The results show that the proposed method offers improvements on objective metrics in speech intelligibility and signal quality., Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Published: 2022

29. Disentangled Speech Representation Learning Based on Factorized Hierarchical Variational Autoencoder with Self-Supervised Objective

Author: Xie, Yuying, Arildsen, Thomas, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks. Inspired by the success of these two representation learning methods, this paper proposes to integrate the APC objective into the FHVAE framework aiming at benefiting from the additional self-supervision target. The main proposed method requires neither more training data nor more computational cost at test time, but obtains improved meaningful representations while maintaining disentanglement. The experiments were conducted on the TIMIT dataset. Results demonstrate that FHVAE equipped with the additional self-supervised objective is able to learn features providing superior performance for tasks including speech recognition and speaker recognition. Furthermore, voice conversion, as one application of disentangled representation learning, has been applied and evaluated. The results show performance similar to baseline of the new framework on voice conversion., Comment: Published in: 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)
Published: 2022
Full Text: View/download PDF

30. Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Author: Yu, Fan, Zhang, Shiliang, Guo, Pengcheng, Fu, Yihui, Du, Zhihao, Zheng, Siqi, Huang, Weilong, Xie, Lei, Tan, Zheng-Hua, Wang, DeLiang, Qian, Yanmin, Lee, Kong Aik, Yan, Zhijie, Ma, Bin, Xu, Xin, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions., Comment: Accepted by ICASSP 2022
Published: 2022

31. Extending battery life in CubeSats by charging current control utilizing a long short-term memory network for solar power predictions

Author: Knap, Vaclav, Bonvang, Gustav A.P., Fagerlund, Frederik Rentzø, Krøyer, Sune, Nguyen, Kim, Thorsager, Mathias, and Tan, Zheng-Hua
Published: 2024
Full Text: View/download PDF

32. On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

Author: Sarkar, Achintya kr. and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL) and auto-regressive prediction coding with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid, irrespective of training target. Among the three training targets, TCL performs the best. Among the various loss functions, cross entropy, joint-softmax and focal loss functions outperform the others. Finally, score-level fusion of different systems is also able to reduce the error rates. Experiments are conducted on the RedDots 2016 challenge database for TD-SV using short utterances. For the speaker classifications, the well-known Gaussian mixture model-universal background model (GMM-UBM) and i-vector techniques are used.
Published: 2022

33. Deep Spoken Keyword Spotting: An Overview

Author: López-Espejo, Iván, Tan, Zheng-Hua, Hansen, John, and Jensen, Jesper
Subjects: Computer Science - Sound, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.
Published: 2021

34. Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

Author: Fuglsig, Andreas Jonas, Østergaard, Jan, Jensen, Jesper, Bertelsen, Lars Søndergaard, Mariager, Peter, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper considers speech enhancement of signals picked up in one noisy environment which must be presented to a listener in another noisy environment. Recently, it has been shown that an optimal solution to this problem requires the consideration of the noise sources in both environments jointly. However, the existing optimal mutual information based method requires a complicated system model that includes natural speech variations, and relies on approximations and assumptions of the underlying signal distributions. In this paper, we propose to use a simpler signal model and optimize speech intelligibility based on the Approximated Speech Intelligibility Index (ASII). We derive a closed-form solution to the joint far- and near-end speech enhancement problem that is independent of the marginal distribution of signal coefficients, and that achieves similar performance to existing work. In addition, we do not need to model or optimize for natural speech variations.
Published: 2021
Full Text: View/download PDF

35. Radio Sensing with Large Intelligent Surface for 6G

Author: Vaca-Rubio, Cristian J., Ramirez-Espinosa, Pablo, Kansanen, Kimmo, Tan, Zheng-Hua, and de Carvalho, Elisabeth
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: This paper leverages the potential of Large Intelligent Surface (LIS) for radio sensing in 6G wireless networks. Major research has been undergone about its communication capabilities but it can be exploited as a formidable tool for radio sensing. By taking advantage of arbitrary communication signals occurring in the scenario, we apply direct processing to the output signal from the LIS to obtain a radio map that describes the physical presence of passive devices (scatterers, humans) which act as virtual sources due to the communication signal reflections. We then assess the usage of machine learning (k-means clustering), image processing and computer vision (template matching and component labeling) to extract meaningful information from these radio maps. As an exemplary use case, we evaluate this method for both active and passive user detection in an indoor setting. The results show that the presented method has high application potential as we are able to detect around 98% of humans passively and 100% active users by just using communication signals of commodity devices even in quite unfavorable Signal-to-Noise Ratio (SNR) conditions.
Published: 2021

36. Design of AoI-Aware 5G Uplink Scheduler UsingReinforcement Learning

Author: Wu, Chien-Cheng, Popovski, Petar, Tan, Zheng-Hua, and Stefanovic, Cedomir
Subjects: Computer Science - Networking and Internet Architecture
Abstract: Age of Information (AoI) reflects the time that is elapsed from the generation of a packet by a 5G user equipment(UE) to the reception of the packet by a controller. A design of an AoI-aware radio resource scheduler for UEs via reinforcement learning is proposed in this paper. In this paper, we consider a remote control environment in which a number of UEs are transmitting time-sensitive measurements to a remote controller. We consider the AoI minimization problem and formulate the problem as a trade-off between minimizing the sum of the expected AoI of all UEs and maximizing the throughput of the network. Inspired by the success of machine learning in solving large networking problems at low complexity, we develop a reinforcement learning-based method to solve the formulated problem. We used the state-of-the-art proximal policy optimization algorithm to solve this problem. Our simulation results showthat the proposed algorithm outperforms the considered baselines in terms of minimizing the expected AoI while maintaining the network throughput.
Published: 2021

37. Remote Anomaly Detection in Industry 4.0 Using Resource-Constrained Devices

Author: Kalør, Anders E., Michelsanti, Daniel, Chiariotti, Federico, Tan, Zheng-Hua, and Popovski, Petar
Subjects: Computer Science - Information Theory
Abstract: A central use case for the Internet of Things (IoT) is the adoption of sensors to monitor physical processes, such as the environment and industrial manufacturing processes, where they provide data for predictive maintenance, anomaly detection, or similar. The sensor devices are typically resource-constrained in terms of computation and power, and need to rely on cloud or edge computing for data processing. However, the capacity of the wireless link and their power constraints limit the amount of data that can be transmitted to the cloud. While this is not problematic for the monitoring of slowly varying processes such as temperature, it is more problematic for complex signals such as those captured by vibration and acoustic sensors. In this paper, we consider the specific problem of remote anomaly detection based on signals that fall into the latter category over wireless channels with resource-constrained sensors. We study the impact of source coding on the detection accuracy with both an anomaly detector based on Principal Component Analysis (PCA) and one based on an autoencoder. We show that the coded transmission is beneficial when the signal-to-noise ratio (SNR) of the channel is low, while uncoded transmission performs best in the high SNR regime., Comment: Presented at SPAWC 2021
Published: 2021

38. Explicit construction of the minimum error variance estimator for stochastic LTI state-space systems

Author: Eringis, Deividas, Leth, John, Tan, Zheng-Hua, Wisniewski, Rafal, and Petreczky, Mihaly
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Mathematics - Dynamical Systems, Statistics - Machine Learning
Abstract: In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs. Similar derivations have been done before but not using state-space representation.
Published: 2021

39. Improvement of Noise-Robust Single-Channel Voice Activity Detection with Spatial Pre-processing

Author: Væhrens, Max, Fuglsig, Andreas Jonas, Jacobsen, Anders Post, Rasmussen, Nicolai Almskou, Nissen, Victor Mølbach, Hejslet, Joachim Roland, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Voice activity detection (VAD) remains a challenge in noisy environments. With access to multiple microphones, prior studies have attempted to improve the noise robustness of VAD by creating multi-channel VAD (MVAD) methods. However, MVAD is relatively new compared to single-channel VAD (SVAD), which has been thoroughly developed in the past. It might therefore be advantageous to improve SVAD methods with pre-processing to obtain superior VAD, which is under-explored. This paper improves SVAD through two pre-processing methods, a beamformer and a spatial target speaker detector. The spatial detector sets signal frames to zero when no potential speaker is present within a target direction. The detector may be implemented as a filter, meaning the input signal for the SVAD is filtered according to the detector's output; or it may be implemented as a spatial VAD to be combined with the SVAD output. The evaluation is made on a noisy reverberant speech database, with clean speech from the Aurora 2 database and with white and babble noise. The results show that SVAD algorithms are significantly improved by the presented pre-processing methods, especially the spatial detector, across all signal-to-noise ratios. The SVAD algorithms with pre-processing significantly outperform a baseline MVAD in challenging noise conditions., Comment: Submitted to Interspeech 2021
Published: 2021

40. INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing

Author: Rao, Wei, Fu, Yihui, Hu, Yanxin, Xu, Xin, Jv, Yvkai, Han, Jiangyu, Jiang, Zhongjie, Xie, Lei, Wang, Yannan, Watanabe, Shinji, Tan, Zheng-Hua, Bu, Hui, Yu, Tao, and Shang, Shidong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing. The challenge consists of two separate tasks: 1) Task 1 is multi-channel speech enhancement with single microphone array and focusing on practical application with real-time requirement and 2) Task 2 is multi-channel speech enhancement with multiple distributed microphone arrays, which is a non-real-time track and does not have any constraints so that participants could explore any algorithms to obtain high speech quality. Targeting the real video conferencing room application, the challenge database was recorded from real speakers and all recording facilities were located by following the real setup of conferencing room. In this challenge, we open-sourced the list of open source clean speech and noise datasets, simulation scripts, and a baseline system for participants to develop their own system. The final ranking of the challenge will be decided by the subjective evaluation which is performed using Absolute Category Ratings (ACR) to estimate Mean Opinion Score (MOS), speech MOS (S-MOS), and noise MOS (N-MOS). This paper describes the challenge, tasks, datasets, and subjective evaluation. The baseline system which is a complex ratio mask based neural network and its experimental results are also presented., Comment: 5 pages, submitted to INTERSPEECH 2021
Published: 2021

41. On TasNet for Low-Latency Single-Speaker Speech Enhancement

Author: Kolbæk, Morten, Tan, Zheng-Hua, Jensen, Søren Holdt, and Jensen, Jesper
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, speech processing algorithms have seen tremendous progress primarily due to the deep learning renaissance. This is especially true for speech separation where the time-domain audio separation network (TasNet) has led to significant improvements. However, for the related task of single-speaker speech enhancement, which is of obvious importance, it is yet unknown, if the TasNet architecture is equally successful. In this paper, we show that TasNet improves state-of-the-art also for speech enhancement, and that the largest gains are achieved for modulated noise sources such as speech. Furthermore, we show that TasNet learns an efficient inner-domain representation, where target and noise signal components are highly separable. This is especially true for noise in terms of interfering speech signals, which might explain why TasNet performs so well on the separation task. Additionally, we show that TasNet performs poorly for large frame hops and conjecture that aliasing might be the main cause of this performance drop. Finally, we show that TasNet consistently outperforms a state-of-the-art single-speaker speech enhancement system.
Published: 2021

42. PAC-Bayesian theory for stochastic LTI systems

Author: Eringis, Deividas, Leth, John, Tan, Zheng-Hua, Wisniewski, Rafal, Esfahani, Alireza Fakhrizadeh, and Petreczky, Mihaly
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: In this paper we derive a PAC-Bayesian error bound for autonomous stochastic LTI state-space models. The motivation for deriving such error bounds is that they will allow deriving similar error bounds for more general dynamical systems, including recurrent neural networks. In turn, PACBayesian error bounds are known to be useful for analyzing machine learning algorithms and for deriving new ones.
Published: 2021

43. Data Generation Using Pass-phrase-dependent Deep Auto-encoders for Text-Dependent Speaker Verification

Author: Sarkar, Achintya Kumar, Sahidullah, Md, and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose a novel method that trains pass-phrase specific deep neural network (PP-DNN) based auto-encoders for creating augmented data for text-dependent speaker verification (TD-SV). Each PP-DNN auto-encoder is trained using the utterances of a particular pass-phrase available in the target enrollment set with two methods: (i) transfer learning and (ii) training from scratch. Next, feature vectors of a given utterance are fed to the PP-DNNs and the output from each PP-DNN at frame-level is considered one new set of generated data. The generated data from each PP-DNN is then used for building a TD-SV system in contrast to the conventional method that considers only the evaluation data available. The proposed approach can be considered as the transformation of data to the pass-phrase specific space using a non-linear transformation learned by each PP-DNN. The method develops several TD-SV systems with the number equal to the number of PP-DNNs separately trained for each pass-phrases for the evaluation. Finally, the scores of the different TD-SV systems are fused for decision making. Experiments are conducted on the RedDots challenge 2016 database for TD-SV using short utterances. Results show that the proposed method improves the performance for both conventional cepstral feature and deep bottleneck feature using both Gaussian mixture model - universal background model (GMM-UBM) and i-vector framework.
Published: 2021

44. Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding

Author: Sarkar, Achintya kr. and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this letter, we propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV), in which a set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision. Next, we explore the bottleneck (BN) feature extracted by training deep neural networks with a self-supervised objective, autoregressive predictive coding (APC), for TD-SV and compare it with the well-studied speaker-discriminant BN feature. The proposed VTL method is then applied to APC and speaker-discriminant BN features. In the end, we combine the VTL perturbation systems trained on MFCC and the two BN features in the score domain. Experiments are performed on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Results show the proposed methods significantly outperform the baselines., Comment: Copyright (c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Published: 2020
Full Text: View/download PDF

45. Assessing Wireless Sensing Potential with Large Intelligent Surfaces

Author: Vaca-Rubio, Cristian J., Ramirez-Espinosa, Pablo, Kansanen, Kimmo, Tan, Zheng-Hua, de Carvalho, Elisabeth, and Popovski, Petar
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Sensing capability is one of the most highlighted new feature of future 6G wireless networks. This paper addresses the sensing potential of Large Intelligent Surfaces (LIS) in an exemplary Industry 4.0 scenario. Besides the attention received by LIS in terms of communication aspects, it can offer a high-resolution rendering of the propagation environment. This is because, in an indoor setting, it can be placed in proximity to the sensed phenomena, while the high resolution is offered by densely spaced tiny antennas deployed over a large area. By treating an LIS as a radio image of the environment relying on the received signal power, we develop techniques to sense the environment, by leveraging the tools of image processing and machine learning. Once a holographic image is obtained, a Denoising Autoencoder (DAE) network can be used for constructing a super-resolution image leading to sensing advantages not available in traditional sensing systems. Also, we derive a statistical test based on the Generalized Likelihood Ratio (GLRT) as a benchmark for the machine learning solution. We test these methods for a scenario where we need to detect whether an industrial robot deviates from a predefined route. The results show that the LIS-based sensing offers high precision and has a high application potential in indoor industrial environments., Comment: arXiv admin note: text overlap with arXiv:2006.06563
Published: 2020

46. CC-Loss: Channel Correlation Loss For Image Classification

Author: Song, Zeyu, Chang, Dongliang, Ma, Zhanyu, Li, Xiaoxu, and Tan, Zheng-Hua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The loss function is a key component in deep learning models. A commonly used loss function for classification is the cross entropy loss, which is a simple yet effective application of information theory for classification problems. Based on this loss, many other loss functions have been proposed,~\emph{e.g.}, by adding intra-class and inter-class constraints to enhance the discriminative ability of the learned features. However, these loss functions fail to consider the connections between the feature distribution and the model structure. Aiming at addressing this problem, we propose a channel correlation loss (CC-Loss) that is able to constrain the specific relations between classes and channels as well as maintain the intra-class and the inter-class separability. CC-Loss uses a channel attention module to generate channel attention of features for each sample in the training stage. Next, an Euclidean distance matrix is calculated to make the channel attention vectors associated with the same class become identical and to increase the difference between different classes. Finally, we obtain a feature embedding with good intra-class compactness and inter-class separability.Experimental results show that two different backbone models trained with the proposed CC-Loss outperform the state-of-the-art loss functions on three image classification datasets., Comment: accepted by ICPR2020
Published: 2020

47. Advanced Dropout: A Model-free Methodology for Bayesian Dropout Optimization

Author: Xie, Jiyang, Ma, Zhanyu, Lei, and Jianjun, Zhang, Guoqiang, Xue, Jing-Hao, Tan, Zheng-Hua, and Guo, Jun
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Due to lack of data, overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs). We propose advanced dropout, a model-free methodology, to mitigate overfitting and improve the performance of DNNs. The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate. Specifically, the distribution parameters are optimized by stochastic gradient variational Bayes in order to carry out an end-to-end training. We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets (five small-scale datasets and two large-scale datasets) with various base models. The advanced dropout outperforms all the referred techniques on all the datasets.We further compare the effectiveness ratios and find that advanced dropout achieves the highest one on most cases. Next, we conduct a set of analysis of dropout rate characteristics, including convergence of the adaptive dropout rate, the learned distributions of dropout masks, and a comparison with dropout rate generation without an explicit distribution. In addition, the ability of overfitting prevention is evaluated and confirmed. Finally, we extend the application of the advanced dropout to uncertainty inference, network pruning, text classification, and regression. The proposed advanced dropout is also superior to the corresponding referred methods. Codes are available at https://github.com/PRIS-CV/AdvancedDropout., Comment: Accepted by IEEE TPAMI, 2021
Published: 2020
Full Text: View/download PDF

48. Audio-Visual Speech Inpainting with Deep Learning

Author: Morrone, Giovanni, Michelsanti, Daniel, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision., Comment: Accepted at ICASSP 2021
Published: 2020

49. An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Author: Michelsanti, Daniel, Tan, Zheng-Hua, Zhang, Shi-Xiong, Xu, Yong, Yu, Meng, Yu, Dong, and Jensen, Jesper
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance.
Published: 2020

50. UIAI System for Short-Duration Speaker Verification Challenge 2020

Author: Sahidullah, Md, Sarkar, Achintya Kumar, Vestman, Ville, Liu, Xuechen, Serizel, Romain, Kinnunen, Tomi, Tan, Zheng-Hua, and Vincent, Emmanuel
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound
Abstract: In this work, we present the system description of the UIAI entry for the short-duration speaker verification (SdSV) challenge 2020. Our focus is on Task 1 dedicated to text-dependent speaker verification. We investigate different feature extraction and modeling approaches for automatic speaker verification (ASV) and utterance verification (UV). We have also studied different fusion strategies for combining UV and ASV modules. Our primary submission to the challenge is the fusion of seven subsystems which yields a normalized minimum detection cost function (minDCF) of 0.072 and an equal error rate (EER) of 2.14% on the evaluation set. The single system consisting of a pass-phrase identification based model with phone-discriminative bottleneck features gives a normalized minDCF of 0.118 and achieves 19% relative improvement over the state-of-the-art challenge baseline.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

648 results on '"Tan, Zheng-Hua"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources