Author: "Tan, Zheng-Hua" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tan, Zheng-Hua"' showing total 648 results

Start Over Author "Tan, Zheng-Hua"

648 results on '"Tan, Zheng-Hua"'

51. Data augmentation enhanced speaker enrollment for text-dependent speaker verification

Author: Sarkar, Achintya Kumar, Sarma, Himangshu, Dwivedi, Priyanka, and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Data augmentation is commonly used for generating additional data from the available training data to achieve a robust estimation of the parameters of complex models like the one for speaker verification (SV), especially for under-resourced applications. SV involves training speaker-independent (SI) models and speaker-dependent models where speakers are represented by models derived from an SI model using the training data for the particular speaker during the enrollment phase. While data augmentation for training SI models is well studied, data augmentation for speaker enrollment is rarely explored. In this paper, we propose the use of data augmentation methods for generating extra data to empower speaker enrollment. Each data augmentation method generates a new data set. Two strategies of using the data sets are explored: the first one is to training separate systems and fuses them at the score level and the other is to conduct multi-conditional training. Furthermore, we study the effect of data augmentation under noisy conditions. Experiments are performed on RedDots challenge 2016 database, and the results validate the effectiveness of the proposed methods.
Published: 2020

52. A Primer on Large Intelligent Surface (LIS) for Wireless Sensing in an Industrial Setting

Author: Vaca-Rubio, Cristian J., Ramirez-Espinosa, Pablo, Williams, Robin Jess, Kansanen, Kimmo, Tan, Zheng-Hua, de Carvalho, Elisabeth, and Popovski, Petar
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: One of the beyond-5G developments that is often highlighted is the integration of wireless communication and radio sensing. This paper addresses the potential of communication-sensing integration of Large Intelligent Surfaces (LIS) in an exemplary Industry 4.0 scenario. Besides the potential for high throughput and efficient multiplexing of wireless links, an LIS can offer a high-resolution rendering of the propagation environment. This is because, in an indoor setting, it can be placed in proximity to the sensed phenomena, while the high resolution is offered by densely spaced tiny antennas deployed over a large area. By treating an LIS as a radio image of the environment, we develop sensing techniques that leverage the usage of computer vision combined with machine learning. We test these methods for a scenario where we need to detect whether an industrial robot deviates from a predefined route. The results show that the LIS-based sensing offers high precision and has a high application potential in indoor industrial environments.
Published: 2020

53. Exploring Filterbank Learning for Keyword Spotting

Author: López-Espejo, Iván, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Despite their great performance over the years, handcrafted speech features are not necessarily optimal for any particular speech application. Consequently, with greater or lesser success, optimal filterbank learning has been studied for different speech processing tasks. In this paper, we fill in a gap by exploring filterbank learning for keyword spotting (KWS). Two approaches are examined: filterbank matrix learning in the power spectral domain and parameter learning of a psychoacoustically-motivated gammachirp filterbank. Filterbank parameters are optimized jointly with a modern deep residual neural network-based KWS back-end. Our experimental results reveal that, in general, there are no statistically significant differences, in terms of KWS accuracy, between using a learned filterbank and handcrafted speech features. Thus, while we conclude that the latter are still a wise choice when using modern KWS back-ends, we also hypothesize that this could be a symptom of information redundancy, which opens up new research possibilities in the field of small-footprint KWS.
Published: 2020

54. On Bottleneck Features for Text-Dependent Speaker Verification Using X-vectors

Author: Sarkar, Achintya Kumar and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Applying x-vectors for speaker verification has recently attracted great interest, with the focus being on text-independent speaker verification. In this paper, we study x-vectors for text-dependent speaker verification (TD-SV), which remains unexplored. We further investigate the impact of the different bottleneck (BN) features on the performance of x-vectors, including the recently-introduced time-contrastive-learning (TCL) BN features and phone-discriminant BN features. TCL is a weakly supervised learning approach that constructs training data by uniformly partitioning each utterance into a predefined number of segments and then assigning each segment a class label depending on their position in the utterance. We also compare TD-SV performance for different modeling techniques, including the Gaussian mixture models-universal background model (GMM-UBM), i-vector, and x-vector. Experiments are conducted on the RedDots 2016 challenge database. It is found that the type of features has a marginal impact on the performance of x-vectors with the TCL BN feature achieving the lowest equal error rate, while the impact of features is significant for i-vector and GMM-UBM. The fusion of x-vector and i-vector systems gives a large gain in performance. The GMM-UBM technique shows its advantage for TD-SV using short utterances.
Published: 2020

55. OSLNet: Deep Small-Sample Classification with an Orthogonal Softmax Layer

Author: Li, Xiaoxu, Chang, Dongliang, Ma, Zhanyu, Tan, Zheng-Hua, Xue, Jing-Hao, Cao, Jie, Yu, Jingyi, and Guo, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: A deep neural network of multiple nonlinear layers forms a large function space, which can easily lead to overfitting when it encounters small-sample data. To mitigate overfitting in small-sample classification, learning more discriminative features from small-sample data is becoming a new trend. To this end, this paper aims to find a subspace of neural networks that can facilitate a large decision margin. Specifically, we propose the Orthogonal Softmax Layer (OSL), which makes the weight vectors in the classification layer remain orthogonal during both the training and test processes. The Rademacher complexity of a network using the OSL is only $\frac{1}{K}$, where $K$ is the number of classes, of that of a network using the fully connected classification layer, leading to a tighter generalization error bound. Experimental results demonstrate that the proposed OSL has better performance than the methods used for comparison on four small-sample benchmark datasets, as well as its applicability to large-sample datasets. Codes are available at: https://github.com/dongliangchang/OSLNet., Comment: TIP 2020. Code available at https://github.com/dongliangchang/OSLNet
Published: 2020
Full Text: View/download PDF

56. Vocoder-Based Speech Synthesis from Silent Videos

Author: Michelsanti, Daniel, Slizovskaia, Olga, Haro, Gloria, Gómez, Emilia, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches., Comment: Accepted to Interspeech 2020
Published: 2020

57. Relaxed N-Pairs Loss for Context-Aware Recommendations of Television Content

Author: Kristoffersen, Miklas S., Shepstone, Sven E., and Tan, Zheng-Hua
Subjects: Computer Science - Information Retrieval
Abstract: This paper studies context-aware recommendations in the television domain by proposing a deep learning-based method for learning joint context-content embeddings (JCCE). The method builds on recent developments within recommendations using latent representations and deep metric learning, in order to effectively represent contextual settings of viewing situations as well as available content in a shared latent space. This embedding space is used for exploring relevant content in various viewing settings by applying an N-pairs loss objective as well as a relaxed variant proposed in this paper. Experiments confirm the recommendation ability of JCCE, achieving improvements when compared to state-of-the-art methods. Further experiments display useful structures in the learned embeddings that can be used for gaining valuable knowledge of underlying variables in the relationship between contextual settings and content properties.
Published: 2020

58. Adversarial Example Detection by Classification for Deep Speech Recognition

Author: Samizade, Saeid, Tan, Zheng-Hua, Shen, Chao, and Guan, Xiaohong
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Machine Learning systems are vulnerable to adversarial attacks and will highly likely produce incorrect outputs under these attacks. There are white-box and black-box attacks regarding to adversary's access level to the victim learning algorithm. To defend the learning systems from these attacks, existing methods in the speech domain focus on modifying input signals and testing the behaviours of speech recognizers. We, however, formulate the defense as a classification problem and present a strategy for systematically generating adversarial example datasets: one for white-box attacks and one for black-box attacks, containing both adversarial and normal examples. The white-box attack is a gradient-based method on Baidu DeepSpeech with the Mozilla Common Voice database while the black-box attack is a gradient-free method on a deep model-based keyword spotting system with the Google Speech Command dataset. The generated datasets are used to train a proposed Convolutional Neural Network (CNN), together with cepstral features, to detect adversarial examples. Experimental results show that, it is possible to accurately distinct between adversarial and normal examples for known attacks, in both single-condition and multi-condition training settings, while the performance degrades dramatically for unknown attacks. The adversarial datasets and the source code are made publicly available.
Published: 2019

59. Deep Joint Embeddings of Context and Content for Recommendation

Author: Kristoffersen, Miklas S., Wieland, Jacob L., Shepstone, Sven E., Tan, Zheng-Hua, and Vinayagamoorthy, Vinoba
Subjects: Computer Science - Information Retrieval
Abstract: This paper proposes a deep learning-based method for learning joint context-content embeddings (JCCE) with a view to context-aware recommendations, and demonstrate its application in the television domain. JCCE builds on recent progress within latent representations for recommendation and deep metric learning. The model effectively groups viewing situations and associated consumed content, based on supervision from 2.7 million viewing events. Experiments confirm the recommendation ability of JCCE, achieving improvements when compared to state-of-the-art methods. Furthermore, the approach shows meaningful structures in the learned representations that can be used to gain valuable insights of underlying factors in the relationship between contextual settings and content properties., Comment: Accepted for CARS 2.0 - Context-Aware Recommender Systems Workshop @ RecSys'19
Published: 2019

60. On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Author: Kolbæk, Morten, Tan, Zheng-Hua, Jensen, Søren Holdt, and Jensen, Jesper
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous if the receiver is the human auditory system. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems., Comment: Published in the IEEE Transactions on Audio, Speech and Language Processing
Published: 2019
Full Text: View/download PDF

61. Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Author: López-Espejo, Iván, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Computer Science - Sound, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of small electronic devices that allow interaction with them via speech. Often, KWS systems are speaker-independent, which means that any person --user or not-- might trigger them. For applications like KWS for hearing assistive devices this is unacceptable, as only the user must be allowed to handle them. In this paper we propose KWS for hearing assistive devices that is robust to external speakers. A state-of-the-art deep residual network for small-footprint KWS is regarded as a basis to build upon. By following a multi-task learning scheme, this system is extended to jointly perform KWS and users' own-voice/external speaker detection with a negligible increase in the number of parameters. For experiments, we generate from the Google Speech Commands Dataset a speech corpus emulating hearing aids as a capturing device. Our results show that this multi-task deep residual network is able to achieve a KWS accuracy relative improvement of around 32% with respect to a system that does not deal with external speakers.
Published: 2019

62. rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method

Author: Tan, Zheng-Hua, Sarkar, Achintya kr., and Dehak, Najim
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.
Published: 2019

63. Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

Author: Michelsanti, Daniel, Tan, Zheng-Hua, Sigurdsson, Sigurdur, and Jensen, Jesper
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field. We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of -5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.
Published: 2019
Full Text: View/download PDF

64. Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

Author: Sarkar, Achintya kr., Tan, Zheng-Hua, Tang, Hao, Shon, Suwon, and Glass, James
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarity of brain signals for classification of brain states. Speech signals have similar non-stationarity property, and TCL further has the advantage of having no need for labeled data. We therefore present a TCL based BN feature extraction method. The method uniformly partitions each speech utterance in a training dataset into a predefined number of multi-frame segments. Each segment in an utterance corresponds to one class, and class labels are shared across utterances. DNNs are then trained to discriminate all speech frames among the classes to exploit the temporal structure of speech. In addition, we propose a segment-based unsupervised clustering algorithm to re-assign class labels to the segments. TD-SV experiments were conducted on the RedDots challenge database. The TCL-DNNs were trained using speech data of fixed pass-phrases that were excluded from the TD-SV evaluation set, so the learned features can be considered phrase-independent. We compare the performance of the proposed TCL bottleneck (BN) feature with those of short-time cepstral features and BN features extracted from DNNs discriminating speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels and boundaries are generated by three different automatic speech recognition (ASR) systems. Experimental results show that the proposed TCL-BN outperforms cepstral features and speaker+pass-phrase discriminant BN features, and its performance is on par with those of ASR derived BN features. Moreover,...., Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Published: 2019
Full Text: View/download PDF

65. On the deficiency of intelligibility metrics as proxies for subjective intelligibility

Author: López-Espejo, Iván, Edraki, Amin, Chan, Wai-Yip, Tan, Zheng-Hua, and Jensen, Jesper
Published: 2023
Full Text: View/download PDF

66. Subjective Annotations for Vision-Based Attention Level Estimation

Author: Coifman, Andrea, Rohoska, Péter, Kristoffersen, Miklas S., Shepstone, Sven E., and Tan, Zheng-Hua
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Attention level estimation systems have a high potential in many use cases, such as human-robot interaction, driver modeling and smart home systems, since being able to measure a person's attention level opens the possibility to natural interaction between humans and computers. The topic of estimating a human's visual focus of attention has been actively addressed recently in the field of HCI. However, most of these previous works do not consider attention as a subjective, cognitive attentive state. New research within the field also faces the problem of the lack of annotated datasets regarding attention level in a certain context. The novelty of our work is two-fold: First, we introduce a new annotation framework that tackles the subjective nature of attention level and use it to annotate more than 100,000 images with three attention levels and second, we introduce a novel method to estimate attention levels, relying purely on extracted geometric features from RGB and depth images, and evaluate it with a deep learning fusion framework. The system achieves an overall accuracy of 80.02%. Our framework and attention level annotations are made publicly available., Comment: 14th International Conference on Computer Vision Theory and Applications
Published: 2018

67. Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems

Author: Michelsanti, Daniel, Tan, Zheng-Hua, Sigurdsson, Sigurdur, and Jensen, Jesper
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect. Current speech enhancement systems based on deep learning do not usually take into account this change in the speaking style, because they are trained with neutral (non-Lombard) speech utterances recorded under quiet conditions to which noise is artificially added. In this paper, we investigate the effects that the Lombard reflex has on the performance of audio-visual speech enhancement systems based on deep learning. The results show that a gap in the performance of as much as approximately 5 dB between the systems trained on neutral speech and the ones trained on Lombard speech exists. This indicates the benefit of taking into account the mismatch between neutral and Lombard speech in the design of audio-visual speech enhancement systems.
Published: 2018
Full Text: View/download PDF

68. On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Author: Michelsanti, Daniel, Tan, Zheng-Hua, Sigurdsson, Sigurdur, and Jensen, Jesper
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.
Published: 2018
Full Text: View/download PDF

69. Single-Crystalline SrTiO3 as Memristive Model System: From Materials Science to Neurological and Psychological Functions

Author: Yin, Xue-Bing, Tan, Zheng-Hua, Yang, Rui, Guo, Xin, Tuller, Harry L., Series Editor, Rupp, Jennifer, editor, Ielmini, Daniele, editor, and Valov, Ilia, editor
Published: 2022
Full Text: View/download PDF

70. The Importance of Context When Recommending TV Content: Dataset and Algorithms

Author: Kristoffersen, Miklas S., Shepstone, Sven E., and Tan, Zheng-Hua
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning, Computer Science - Multimedia, Statistics - Machine Learning
Abstract: Home entertainment systems feature in a variety of usage scenarios with one or more simultaneous users, for whom the complexity of choosing media to consume has increased rapidly over the last decade. Users' decision processes are complex and highly influenced by contextual settings, but data supporting the development and evaluation of context-aware recommender systems are scarce. In this paper we present a dataset of self-reported TV consumption enriched with contextual information of viewing situations. We show how choice of genre associates with, among others, the number of present users and users' attention levels. Furthermore, we evaluate the performance of predicting chosen genres given different configurations of contextual information, and compare the results to contextless predictions. The results suggest that including contextual features in the prediction cause notable improvements, and both temporal and social context show significant contributions.
Published: 2018
Full Text: View/download PDF

71. On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

Author: Kolbæk, Morten, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g. speech intelligibility. Short-Time Objective Intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner which is optimal with respect to the STOI speech intelligibility estimator.
Published: 2018
Full Text: View/download PDF

72. A Parallel/Distributed Algorithmic Framework for Mining All Quantitative Association Rules

Author: Christou, Ioannis T., Amolochitis, Emmanouil, and Tan, Zheng-Hua
Subjects: Computer Science - Artificial Intelligence, Computer Science - Databases
Abstract: We present QARMA, an efficient novel parallel algorithm for mining all Quantitative Association Rules in large multidimensional datasets where items are required to have at least a single common attribute to be specified in the rules single consequent item. Given a minimum support level and a set of threshold criteria of interestingness measures such as confidence, conviction etc. our algorithm guarantees the generation of all non-dominated Quantitative Association Rules that meet the minimum support and interestingness requirements. Such rules can be of great importance to marketing departments seeking to optimize targeted campaigns, or general market segmentation. They can also be of value in medical applications, financial as well as predictive maintenance domains. We provide computational results showing the scalability of our algorithm, and its capability to produce all rules to be found in large scale synthetic and real world datasets such as Movie Lens, within a few seconds or minutes of computational time on commodity hardware., Comment: 14 pages, 2 figures
Published: 2018

73. Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

Author: Kolbæk, Morten, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper we propose a Deep Neural Network (DNN) based Speech Enhancement (SE) system that is designed to maximize an approximation of the Short-Time Objective Intelligibility (STOI) measure. We formalize an approximate-STOI cost function and derive analytical expressions for the gradients required for DNN training and show that these gradients have desirable properties when used together with gradient based optimization techniques. We show through simulation experiments that the proposed SE system achieves large improvements in estimated speech intelligibility, when tested on matched and unmatched natural noise types, at multiple signal-to-noise ratios. Furthermore, we show that the SE system, when trained using an approximate-STOI cost function performs on par with a system trained with a mean square error cost applied to short-time temporal envelopes. Finally, we show that the proposed SE system performs on par with a traditional DNN based Short-Time Spectral Amplitude (STSA) SE system in terms of estimated speech intelligibility. These results are important because they suggest that traditional DNN based STSA SE systems might be optimal in terms of estimated speech intelligibility., Comment: To appear in ICASSP 2018
Published: 2018

74. Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

Author: Michelsanti, Daniel and Tan, Zheng-Hua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing, Statistics - Machine Learning
Abstract: Improving speech system performance in noisy environments remains a challenging task, and speech enhancement (SE) is one of the effective techniques to solve the problem. Motivated by the promising results of generative adversarial networks (GANs) in a variety of image processing tasks, we explore the potential of conditional GANs (cGANs) for SE, and in particular, we make use of the image processing framework proposed by Isola et al. [1] to learn a mapping from the spectrogram of noisy speech to an enhanced counterpart. The SE cGAN consists of two networks, trained in an adversarial manner: a generator that tries to enhance the input noisy spectrogram, and a discriminator that tries to distinguish between enhanced spectrograms provided by the generator and clean ones from the database using the noisy spectrogram as a condition. We evaluate the performance of the cGAN method in terms of perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and equal error rate (EER) of speaker verification (an example application). Experimental results show that the cGAN method overall outperforms the classical short-time spectral amplitude minimum mean square error (STSA-MMSE) SE algorithm, and is comparable to a deep neural network-based SE approach (DNN-SE)., Comment: INTERSPEECH 2017 August 20-24, 2017, Stockholm, Sweden
Published: 2017
Full Text: View/download PDF

75. Joint Separation and Denoising of Noisy Multi-talker Speech using Recurrent Neural Networks and Permutation Invariant Training

Author: Kolbæk, Morten, Yu, Dong, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs). Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalize well to noise types not seen during training., Comment: To appear in MLSP 2017
Published: 2017

76. Adversarial Network Bottleneck Features for Noise Robust Speaker Verification

Author: Yu, Hong, Tan, Zheng-Hua, Ma, Zhanyu, and Guo, Jun
Subjects: Computer Science - Sound
Abstract: In this paper, we propose a noise robust bottleneck feature representation which is generated by an adversarial network (AN). The AN includes two cascade connected networks, an encoding network (EN) and a discriminative network (DN). Mel-frequency cepstral coefficients (MFCCs) of clean and noisy speech are used as input to the EN and the output of the EN is used as the noise robust feature. The EN and DN are trained in turn, namely, when training the DN, noise types are selected as the training labels and when training the EN, all labels are set as the same, i.e., the clean speech label, which aims to make the AN features invariant to noise and thus achieve noise robustness. We evaluate the performance of the proposed feature on a Gaussian Mixture Model-Universal Background Model based speaker verification system, and make comparison to MFCC features of speech enhanced by short-time spectral amplitude minimum mean square error (STSA-MMSE) and deep neural network-based speech enhancement (DNN-SE) methods. Experimental results on the RSR2015 database show that the proposed AN bottleneck feature (AN-BN) dramatically outperforms the STSA-MMSE and DNN-SE based MFCCs for different noise types and signal-to-noise ratios. Furthermore, the AN-BN feature is able to improve the speaker verification performance under the clean condition.
Published: 2017

77. Decorrelation of Neutral Vector Variables: Theory and Applications

Author: Ma, Zhanyu, Xue, Jing-Hao, Leijon, Arne, Tan, Zheng-Hua, Yang, Zhen, and Guo, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: In this paper, we propose novel strategies for neutral vector variable decorrelation. Two fundamental invertible transformations, namely serial nonlinear transformation and parallel nonlinear transformation, are proposed to carry out the decorrelation. For a neutral vector variable, which is not multivariate Gaussian distributed, the conventional principal component analysis (PCA) cannot yield mutually independent scalar variables. With the two proposed transformations, a highly negatively correlated neutral vector can be transformed to a set of mutually independent scalar variables with the same degrees of freedom. We also evaluate the decorrelation performances for the vectors generated from a single Dirichlet distribution and a mixture of Dirichlet distributions. The mutual independence is verified with the distance correlation measurement. The advantages of the proposed decorrelation strategies are intensively studied and demonstrated with synthesized data and practical application evaluations.
Published: 2017

78. Time-Contrastive Learning Based DNN Bottleneck Features for Text-Dependent Speaker Verification

Author: Sarkar, Achintya Kr. and Tan, Zheng-Hua
Subjects: Computer Science - Sound, Computer Science - Machine Learning
Abstract: In this paper, we present a time-contrastive learning (TCL) based bottleneck (BN)feature extraction method for speech signals with an application to text-dependent (TD) speaker verification (SV). It is well-known that speech signals exhibit quasi-stationary behavior in and only in a short interval, and the TCL method aims to exploit this temporal structure. More specifically, it trains deep neural networks (DNNs) to discriminate temporal events obtained by uniformly segmenting speech signals, in contrast to existing DNN based BN feature extraction methods that train DNNs using labeled data to discriminate speakers or pass-phrases or phones or a combination of them. In the context of speaker verification, speech data of fixed pass-phrases are used for TCL-BN training, while the pass-phrases used for TCL-BN training are excluded from being used for SV, so that the learned features can be considered generic. The method is evaluated on the RedDots Challenge 2016 database. Experimental results show that TCL-BN is superior to the existing speaker and pass-phrase discriminant BN features and the Mel-frequency cepstral coefficient feature for text-dependent speaker verification.
Published: 2017

79. Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

Author: Kolbæk, Morten, Yu, Dong, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper we propose the utterance-level Permutation Invariant Training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning based solution for speaker independent multi-talker speech separation. Specifically, uPIT extends the recently proposed Permutation Invariant Training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving an additional permutation problem during inference, which is otherwise required by frame-level PIT. We achieve this using Recurrent Neural Networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream. In practice, this allows RNNs, trained with uPIT, to separate multi-talker mixed speech without any prior knowledge of signal duration, number of speakers, speaker identity or gender. We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on Non-negative Matrix Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network (DANet). Furthermore, we found that models trained with uPIT generalize well to unseen speakers and languages. Finally, we found that a single model, trained with uPIT, can handle both two-speaker, and three-speaker speech mixtures.
Published: 2017

80. DNN Filter Bank Cepstral Coefficients for Spoofing Detection

Author: Yu, Hong, Tan, Zheng-Hua, Ma, Zhanyu, and Guo, Jun
Subjects: Computer Science - Sound, Computer Science - Cryptography and Security, Computer Science - Learning
Abstract: With the development of speech synthesis techniques, automatic speaker verification systems face the serious challenge of spoofing attack. In order to improve the reliability of speaker verification systems, we develop a new filter bank based cepstral feature, deep neural network filter bank cepstral coefficients (DNN-FBCC), to distinguish between natural and spoofed speech. The deep neural network filter bank is automatically generated by training a filter bank neural network (FBNN) using natural and synthetic speech. By adding restrictions on the training rules, the learned weight matrix of FBNN is band-limited and sorted by frequency, similar to the normal filter bank. Unlike the manually designed filter bank, the learned filter bank has different filter shapes in different channels, which can capture the differences between natural and synthetic speech more effectively. The experimental results on the ASVspoof {2015} database show that the Gaussian mixture model maximum-likelihood (GMM-ML) classifier trained by the new feature performs better than the state-of-the-art linear frequency cepstral coefficients (LFCC) based classifier, especially on detecting unknown attacks.
Published: 2017

81. Incorporating Pass-Phrase Dependent Background Models for Text-Dependent Speaker Verification

Author: Sarkar, A. K. and Tan, Zheng-Hua
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we propose pass-phrase dependent background models (PBMs) for text-dependent (TD) speaker verification (SV) to integrate the pass-phrase identification process into the conventional TD-SV system, where a PBM is derived from a text-independent background model through adaptation using the utterances of a particular pass-phrase. During training, pass-phrase specific target speaker models are derived from the particular PBM using the training data for the respective target model. While testing, the best PBM is first selected for the test utterance in the maximum likelihood (ML) sense and the selected PBM is then used for the log likelihood ratio (LLR) calculation with respect to the claimant model. The proposed method incorporates the pass-phrase identification step in the LLR calculation, which is not considered in conventional standalone TD-SV systems. The performance of the proposed method is compared to conventional text-independent background model based TD-SV systems using either Gaussian mixture model (GMM)-universal background model (UBM) or Hidden Markov model (HMM)-UBM or i-vector paradigms. In addition, we consider two approaches to build PBMs: speaker-independent and speaker-dependent. We show that the proposed method significantly reduces the error rates of text-dependent speaker verification for the non-target types: target-wrong and imposter-wrong while it maintains comparable TD-SV performance when imposters speak a correct utterance with respect to the conventional system. Experiments are conducted on the RedDots challenge and the RSR2015 databases that consist of short utterances.
Published: 2016

82. A Primer on Large Intelligent Surface (LIS) for Wireless Sensing in an Industrial Setting

Author: Vaca-Rubio, Cristian J., Ramirez-Espinosa, Pablo, Jess Williams, Robin, Kansanen, Kimmo, Tan, Zheng-Hua, de Carvalho, Elisabeth, Popovski, Petar, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin (Sherman), Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Caso, Giuseppe, editor, De Nardis, Luca, editor, and Gavrilovska, Liljana, editor
Published: 2021
Full Text: View/download PDF

83. Self-Supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

Author: Bovbjerg, Holger Severin, primary, Jensen, Jesper, additional, Østergaard, Jan, additional, and Tan, Zheng-Hua, additional
Published: 2024
Full Text: View/download PDF

84. Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler

Author: Gonzalez, Philippe, primary, Tan, Zheng-Hua, additional, Østergaard, Jan, additional, Jensen, Jesper, additional, Alstrøm, Tommy Sonne, additional, and May, Tobias, additional
Published: 2024
Full Text: View/download PDF

85. PAC-Bayes Generalisation Bounds for Dynamical Systems including Stable RNNs

Author: Eringis, Deividas, primary, Leth, John, additional, Tan, Zheng-Hua, additional, Wisniewski, Rafael, additional, and Petreczky, Mihály, additional
Published: 2024
Full Text: View/download PDF

86. Self-segmentation of pass-phrase utterances for deep feature learning in text-dependent speaker verification

Author: Sarkar, Achintya Kumar and Tan, Zheng-Hua
Published: 2021
Full Text: View/download PDF

87. Deep InterBoost networks for small-sample image classification

Author: Li, Xiaoxu, Chang, Dongliang, Ma, Zhanyu, Tan, Zheng-Hua, Xue, Jing-Hao, Cao, Jie, and Guo, Jun
Published: 2021
Full Text: View/download PDF

88. Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation

Author: Yu, Dong, Kolbæk, Morten, Tan, Zheng-Hua, and Jensen, Jesper
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose a novel deep learning model, which supports permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from most of the prior arts that treat speech separation as a multi-class regression problem and the deep clustering technique that considers it a segmentation (or clustering) problem, our model optimizes for the separation regression error, ignoring the order of mixing sources. This strategy cleverly solves the long-lasting label permutation problem that has prevented progress on deep learning based techniques for speech separation. Experiments on the equal-energy mixing setup of a Danish corpus confirms the effectiveness of PIT. We believe improvements built upon PIT can eventually solve the cocktail-party problem and enable real-world adoption of, e.g., automatic meeting transcription and multi-party human-computer interaction, where overlapping speech is common., Comment: 5 pages
Published: 2016

89. Single-Crystalline SrTiO3 as Memristive Model System: From Materials Science to Neurological and Psychological Functions

Author: Yin, Xue-Bing, primary, Tan, Zheng-Hua, additional, Yang, Rui, additional, and Guo, Xin, additional
Published: 2021
Full Text: View/download PDF

90. rVAD: An unsupervised segment-based robust voice activity detection method

Author: Tan, Zheng-Hua, Sarkar, Achintya kr., and Dehak, Najim
Published: 2020
Full Text: View/download PDF

91. Generating Accurate and Diverse Audio Captions through Variational Autoencoder Framework

Author: Zhang, Yiming, primary, Du, Ruoyi, additional, Tan, Zheng-Hua, additional, Wang, Wenwu, additional, and Ma, Zhanyu, additional
Published: 2024
Full Text: View/download PDF

92. How to Train Your Ears: Auditory-Model Emulation for Large-Dynamic-Range Inputs and Mild-to-Severe Hearing Losses

Author: Leer, Peter, primary, Jensen, Jesper, additional, Tan, Zheng-Hua, additional, Østergaard, Jan, additional, and Bramsløw, Lars, additional
Published: 2024
Full Text: View/download PDF

93. Data-Driven Non-Intrusive Speech Intelligibility Prediction Using Speech Presence Probability

Author: Pedersen, Mathias Bach, primary, Jensen, Søren Holdt, additional, Tan, Zheng-Hua, additional, and Jensen, Jesper, additional
Published: 2024
Full Text: View/download PDF

94. Data Science Education: The Signal Processing Perspective [SP Education]

Author: Gannot, Sharon, primary, Tan, Zheng-Hua, additional, Haardt, Martin, additional, Chen, Nancy F., additional, Wai, Hoi-To, additional, Tashev, Ivan, additional, Kellermann, Walter, additional, and Dauwels, Justin, additional
Published: 2023
Full Text: View/download PDF

95. Improved Disentangled Speech Representations Using Contrastive Learning in Factorized Hierarchical Variational Autoencoder

Author: Xie, Yuying, primary, Arildsen, Thomas, additional, and Tan, Zheng-Hua, additional
Published: 2023
Full Text: View/download PDF

96. Speech inpainting: Context-based speech synthesis guided by video

Author: Montesinos, Juan Felipe, primary, Michelsanti, Daniel, additional, Haro, Gloria, additional, Tan, Zheng-Hua, additional, and Jensen, Jesper, additional
Published: 2023
Full Text: View/download PDF