9 results on '"Jen‑Tzung Chien"'
Search Results
2. Variational Domain Adversarial Learning With Mutual Information Maximization for Speaker Verification
- Author
-
Youzhi Tu, Jen-Tzung Chien, and Man-Wai Mak
- Subjects
Acoustics and Ultrasonics ,Artificial neural network ,Computer science ,Speech recognition ,Gaussian ,Maximization ,Mutual information ,Domain (software engineering) ,Constraint (information theory) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,symbols.namesake ,Discriminative model ,Computer Science (miscellaneous) ,symbols ,Electrical and Electronic Engineering ,0305 other medical science ,Communication channel - Abstract
Domain mismatch is a common problem in speaker verification (SV) and often causes performance degradation. For the system relying on the Gaussian PLDA backend to suppress the channel variability, the performance would be further limited if there is no Gaussianity constraint on the learned embeddings. This paper proposes an information-maximized variational domain adversarial neural network (InfoVDANN) that incorporates an InfoVAE into domain adversarial training (DAT) to reduce domain mismatch and simultaneously meet the Gaussianity requirement of the PLDA backend. Specifically, DAT is applied to produce speaker discriminative and domain-invariant features, while the InfoVAE performs variational regularization on the embedded features so that they follow a Gaussian distribution. Another benefit of the InfoVAE is that it avoids posterior collapse in VAEs by preserving the mutual information between the embedded features and the training set so that extra speaker information can be retained in the features. Experiments on both SRE16 and SRE18-CMN2 show that the InfoVDANN outperforms the recent VDANN, which suggests that increasing the mutual information between the embedded features and input features enables the InfoVDANN to extract extra speaker information that is otherwise not possible.
- Published
- 2020
- Full Text
- View/download PDF
3. Multisource I-Vectors Domain Adaptation Using Maximum Mean Discrepancy Based Autoencoders
- Author
-
Man-Wai Mak, Weiwei Lin, and Jen-Tzung Chien
- Subjects
Acoustics and Ultrasonics ,Computer science ,business.industry ,Nonparametric statistics ,Pattern recognition ,02 engineering and technology ,Autoencoder ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Preprocessor ,NIST ,Probability distribution ,Maximum mean discrepancy ,020201 artificial intelligence & image processing ,Artificial intelligence ,Electrical and Electronic Engineering ,0305 other medical science ,business ,Test data - Abstract
Like many machine learning tasks, the performance of speaker verification SV systems degrades when training and test data come from very different distributions. What's more, both training and test data themselves could be composed of heterogeneous subsets. These multisource mismatches are detrimental to SV performance. This paper proposes incorporating maximum mean discrepancy MMD into the loss function of autoencoders to reduce these mismatches. MMD is a nonparametric method for measuring the distance between two probability distributions. With a properly chosen kernel, MMD can match up to infinite moments of data distributions. We generalize MMD to measure the discrepancies of multiple distributions. We call the generalized MMD domainwise MMD. Using domainwise MMD as an objective function, we propose two autoencoders, namely nuisance-attribute autoencoder NAE and domain-invariant autoencoder DAE, for multisource i-vector adaptation. NAE encodes the features that cause most of the multisource mismatch measured by domainwise MMD. DAE directly encodes the features that minimize the multisource mismatch. Using these MMD-based autoencoders as a preprocessing step for PLDA training, we achieve a relative improvement of 19.2% EER on the NIST 2016 SRE compared to PLDA without adaptation. We also found that MMD-based autoencoders are more robust to unseen domains. In the domain robustness experiments, MMD-based autoencoders show 6.8% and 5.2% improvements over IDVC on female and male Cantonese speakers, respectively.
- Published
- 2018
- Full Text
- View/download PDF
4. Bayesian Nonparametric Learning for Hierarchical and Sparse Topics
- Author
-
Jen-Tzung Chien
- Subjects
Hierarchical Dirichlet process ,Topic model ,Acoustics and Ultrasonics ,Computer science ,Document classification ,02 engineering and technology ,Mixture model ,computer.software_genre ,Hierarchical database model ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,Tree (data structure) ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,020201 artificial intelligence & image processing ,Chinese restaurant process ,Data mining ,Electrical and Electronic Engineering ,0305 other medical science ,computer ,Decision tree model - Abstract
This paper presents the Bayesian nonparametric (BNP) learning for hierarchical and sparse topics from natural language. Traditionally, the Indian buffet process provides the BNP prior on a binary matrix for an infinite latent feature model consisting of a flat layer of topics. The nested model paves an avenue to construct a tree model instead of a flat-layer model. This paper presents the nested Indian buffet process (nIBP) to achieve the sparsity and flexibility in topic model where the model complexity and topic hierarchy are learned from the groups of words. The mixed membership modeling is conducted by representing a document using the tree nodes or dishes that a document or a customer chooses according to the nIBP scenario. A tree stick-breaking process is implemented to select topic weights from a subtree for flexible topic modeling. Such an nIBP relaxes the constraint of adopting a single tree path in the nested Chinese restaurant process (nCRP) and, therefore, improves the variety of topic representation for heterogeneous documents. A Gibbs sampling procedure is developed to infer the nIBP topic model. Compared to the nested hierarchical Dirichlet process (nhDP), the compactness of the estimated topics in a tree using nIBP is improved. Experimental results show that the proposed nIBP reduces the error rate of nCRP and nhDP by 18% and 8% on Reuters task for document classification, respectively.
- Published
- 2018
- Full Text
- View/download PDF
5. DNN-Driven Mixture of PLDA for Robust Speaker Verification
- Author
-
Jen-Tzung Chien, Na Li, and Man-Wai Mak
- Subjects
Acoustics and Ultrasonics ,Artificial neural network ,Computer science ,business.industry ,Speech recognition ,Supervised learning ,Posterior probability ,Pattern recognition ,02 engineering and technology ,Mixture model ,Marginal likelihood ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,Discriminative model ,Component (UML) ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,NIST ,020201 artificial intelligence & image processing ,Artificial intelligence ,Electrical and Electronic Engineering ,0305 other medical science ,business - Abstract
The mismatch between enrollment and test utterances due to different types of variabilities is a great challenge in speaker verification. Based on the observation that the SNR-level variability or channel-type variability causes heterogeneous clusters in i-vector space, this paper proposes to apply supervised learning to drive or guide the learning of probabilistic linear discriminant analysis PLDA mixture models. Specifically, a deep neural network DNN is trained to produce the posterior probabilities of different SNR levels or channel types given i-vectors as input. These posteriors then replace the posterior probabilities of indicator variables in the mixture of PLDA. The discriminative training causes the mixture model to perform more reasonable soft divisions of the i-vector space as compared to the conventional mixture of PLDA. During verification, given a test i-vector and a target-speaker's i-vector, the marginal likelihood for the same-speaker hypothesis is obtained by summing the component likelihoods weighted by the component posteriors produced by the DNN, and likewise for the different-speaker hypothesis. Results based on NIST 2012 SRE demonstrate that the proposed scheme leads to better performance under more realistic situations where both training and test utterances cover a wide range of SNRs and different channel types. Unlike the previous SNR-dependent mixture of PLDA which only focuses on SNR mismatch, the proposed model is more general and is potentially applicable to addressing different types of variability in speech.
- Published
- 2017
- Full Text
- View/download PDF
6. Mixture of PLDA for Noise Robust I-Vector Speaker Verification
- Author
-
Jen-Tzung Chien, Xiaomin Pang, and Man-Wai Mak
- Subjects
Acoustics and Ultrasonics ,Noise measurement ,business.industry ,Computer science ,Speech recognition ,Feature extraction ,Posterior probability ,020206 networking & telecommunications ,Pattern recognition ,02 engineering and technology ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,Noise ,Signal-to-noise ratio ,Expectation–maximization algorithm ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Range (statistics) ,NIST ,Artificial intelligence ,Electrical and Electronic Engineering ,0305 other medical science ,business - Abstract
In real-world environments, noisy utterances with variable noise levels are recorded and then converted to i-vectors for cosine distance or PLDA scoring. This paper investigates the effect of noise-level variability on i-vectors. It demonstrates that noise-level variability causes the i-vectors to shift, causing the noise contaminated i-vectors to form clusters in the i-vector space. It also demonstrates that optimal subspaces for discriminating speakers are noise-level dependent. Based on these observations, this paper proposes using signal-to-noise ratio (SNR) of utterances as guidance for training mixture of PLDA models. To maximize the coordination among the PLDA models, mixtures of PLDA models are trained simultaneously via an EM algorithm using the utterances contaminated with noise at various levels. For scoring, given a test i-vector, the marginal likelihoods from individual PLDA models are linearly combined by the posterior probabilities of the test utterance’s SNR. Verification scores are the ratio of the marginal likelihoods. Results based on NIST 2012 SRE suggest that the SNR-dependent mixture of PLDA is not only suitable for the situations where the test utterances exhibit a wide range of SNR, but also beneficial for the test utterances with unknown SNR distribution. Supplementary materials containing full derivations of the EM algorithms and scoring functions can be found in http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf .
- Published
- 2016
- Full Text
- View/download PDF
7. Bayesian Factorization and Learning for Monaural Source Separation
- Author
-
Jen-Tzung Chien and Po-Kai Yang
- Subjects
Acoustics and Ultrasonics ,business.industry ,Bayesian probability ,020206 networking & telecommunications ,Pattern recognition ,02 engineering and technology ,Bayesian inference ,Marginal likelihood ,Non-negative matrix factorization ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,ComputingMethodologies_PATTERNRECOGNITION ,Computer Science::Sound ,Robustness (computer science) ,Prior probability ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Source separation ,Artificial intelligence ,Electrical and Electronic Engineering ,0305 other medical science ,business ,Cluster analysis ,Mathematics - Abstract
This paper presents a new Bayesian nonnegative matrix factorization (NMF) for monaural source separation. Using this approach, the reconstruction error based on NMF is represented by a Poisson distribution, and the NMF parameters, consisting of the basis and weight matrices, are characterized by the exponential priors. A variational Bayesian inference procedure is developed to learn variational parameters and model parameters. The randomness in separation process is faithfully represented so that the system robustness to model variations in heterogeneous environments could be achieved. Importantly, the exponential prior parameters are used to impose sparseness in basis representation. The variational lower bound of log marginal likelihood is adopted as the objective to control model complexity. The dependencies of variational objective on model parameters are fully characterized in the derived closed-form solution. A clustering algorithm is performed to find the groups of bases for unsupervised source separation. The experiments on speech/music separation and singing voice separation show that the proposed Bayesian NMF (BNMF) with adaptive basis representation outperforms the NMF with fixed number of bases and the other BNMFs in terms of signal-to-distortion ratio and the global normalized source to distortion ratio.
- Published
- 2016
- Full Text
- View/download PDF
8. Hierarchical Pitman–Yor–Dirichlet Language Model
- Author
-
Jen-Tzung Chien
- Subjects
Hierarchical Dirichlet process ,Acoustics and Ultrasonics ,Computer science ,business.industry ,Probabilistic logic ,Machine learning ,computer.software_genre ,Latent Dirichlet allocation ,Dirichlet distribution ,Statistics::Machine Learning ,Computational Mathematics ,symbols.namesake ,Computer Science (miscellaneous) ,symbols ,Bayesian hierarchical modeling ,Language model ,Chinese restaurant process ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,computer ,Gibbs sampling - Abstract
Probabilistic models are often viewed as insufficiently expressive because of strong limitation and assumption on the probabilistic distribution and the fixed model complexity. Bayesian nonparametric learning pursues an expressive probabilistic representation based on the nonparametric prior and posterior distributions with less assumption-laden approach to inference. This paper presents a hierarchical Pitman-Yor-Dirichlet (HPYD) process as the nonparametric priors to infer the predictive probabilities of the smoothed n-grams with the integrated topic information. A metaphor of hierarchical Chinese restaurant process is proposed to infer the HPYD language model (HPYD-LM) via Gibbs sampling. This process is equivalent to implement the hierarchical Dirichlet process-latent Dirichlet allocation (HDP-LDA) with the twisted hierarchical Pitman-Yor LM (HPY-LM) as base measures. Accordingly, we produce the power-law distributions and extract the semantic topics to reflect the properties of natural language in the estimated HPYD-LM. The superiority of HPYD-LM to HPY-LM and other language models is demonstrated by the experiments on model perplexity and speech recognition.
- Published
- 2015
- Full Text
- View/download PDF
9. Laplace Group Sensing for Acoustic Models
- Author
-
Jen-Tzung Chien
- Subjects
Acoustics and Ultrasonics ,Markov chain ,Basis (linear algebra) ,business.industry ,Feature vector ,Pattern recognition ,Laplace distribution ,Marginal likelihood ,Computational Mathematics ,Lasso (statistics) ,Laplace's method ,Computer Science (miscellaneous) ,Maximum a posteriori estimation ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,Mathematics - Abstract
This paper presents the group sparse learning for acoustic models where a sequence of acoustic features is driven by Markov chain and each feature vector is represented by groups of basis vectors. The group of common bases represents the features across Markov states within a regression class. The group of individual basis compensates the intra-state residual information. Laplace distribution is used as the sparse prior of sensing weights for group basis representation. Laplace parameter serves as regularization parameter or automatic relevance determination which controls the selection of relevant bases for acoustic modeling. The groups of regularization parameters and basis vectors are estimated from training data by maximizing the marginal likelihood over sensing weights which is implemented by Laplace approximation using the Hessian matrix and the maximum a posteriori parameters. Model uncertainty is compensated through full Bayesian treatment. The connection of Laplace group sensing to lasso regularization is illustrated. Experiments on noisy speech recognition show the robustness of group sparse acoustic models in presence of different noise types and SNRs.
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.