Author: "Lu, Heng" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lu, Heng"' showing total 37 results

Start Over Author "Lu, Heng" Publication Type Reports

37 results on '"Lu, Heng"'

1. Deep Learning Meets OBIA: Tasks, Challenges, Strategies, and Perspectives

Author: Ma, Lei, Yan, Ziyun, Li, Mengmeng, Liu, Tao, Tan, Liqin, Wang, Xuan, He, Weiqiang, Wang, Ruikun, He, Guangjun, Lu, Heng, and Blaschke, Thomas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Deep learning has gained significant attention in remote sensing, especially in pixel- or patch-level applications. Despite initial attempts to integrate deep learning into object-based image analysis (OBIA), its full potential remains largely unexplored. In this article, as OBIA usage becomes more widespread, we conducted a comprehensive review and expansion of its task subdomains, with or without the integration of deep learning. Furthermore, we have identified and summarized five prevailing strategies to address the challenge of deep learning's limitations in directly processing unstructured object data within OBIA, and this review also recommends some important future research directions. Our goal with these endeavors is to inspire more exploration in this fascinating yet overlooked area and facilitate the integration of deep learning into OBIA processing workflows.
Published: 2024

2. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Author: Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, Zheng, Siqi, Gu, Yue, Ma, Ziyang, Gao, Zhifu, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models., Comment: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051
Published: 2024

3. The Impact of Quantization and Pruning on Deep Reinforcement Learning Models

Author: Lu, Heng, Alemi, Mehdi, and Rawassizadeh, Reza
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Deep reinforcement learning (DRL) has achieved remarkable success across various domains, such as video games, robotics, and, recently, large language models. However, the computational costs and memory requirements of DRL models often limit their deployment in resource-constrained environments. The challenge underscores the urgent need to explore neural network compression methods to make RDL models more practical and broadly applicable. Our study investigates the impact of two prominent compression methods, quantization and pruning on DRL models. We examine how these techniques influence four performance factors: average return, memory, inference time, and battery utilization across various DRL algorithms and environments. Despite the decrease in model size, we identify that these compression techniques generally do not improve the energy efficiency of DRL models, but the model size decreases. We provide insights into the trade-offs between model compression and DRL performance, offering guidelines for deploying efficient DRL models in resource-constrained settings.
Published: 2024

4. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Author: An, Keyu, Chen, Qian, Deng, Chong, Du, Zhihao, Gao, Changfeng, Gao, Zhifu, Gu, Yue, He, Ting, Hu, Hangrui, Hu, Kai, Ji, Shengpeng, Li, Yabin, Li, Zerui, Lu, Heng, Luo, Haoneng, Lv, Xiang, Ma, Bin, Ma, Ziyang, Ni, Chongjia, Song, Changhe, Shi, Jiaqi, Shi, Xian, Wang, Hao, Wang, Wen, Wang, Yuxuan, Xiao, Zhangyu, Yan, Zhijie, Yang, Yexin, Zhang, Bin, Zhang, Qinglin, Zhang, Shiliang, Zhao, Nan, and Zheng, Siqi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM., Comment: Work in progress. Authors are listed in alphabetical order by family name
Published: 2024

5. GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

Author: Pan, Yu, Yang, Yuguang, Lu, Heng, Ma, Lei, and Zhao, Jianjun
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches., Comment: Accepted to SLT2024
Published: 2024

6. Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Author: Zhu, Xinfa, Lv, Yuanjun, Lei, Yi, Li, Tao, He, Wendi, Zhou, Hongbin, Lu, Heng, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok ., Comment: 15 pages, 2 figures
Published: 2023

7. SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation

Author: Lv, Yuanjun, Yao, Jixun, Chen, Peikun, Zhou, Hongbin, Lu, Heng, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker anonymization aims to conceal a speaker's identity without degrading speech quality and intelligibility. Most speaker anonymization systems disentangle the speaker representation from the original speech and achieve anonymization by averaging or modifying the speaker representation. However, the anonymized speech is subject to reduction in pseudo speaker distinctiveness, speech quality and intelligibility for out-of-distribution speaker. To solve this issue, we propose SALT, a Speaker Anonymization system based on Latent space Transformation. Specifically, we extract latent features by a self-supervised feature extractor and randomly sample multiple speakers and their weights, and then interpolate the latent vectors to achieve speaker anonymization. Meanwhile, we explore the extrapolation method to further extend the diversity of pseudo speakers. Experiments on Voice Privacy Challenge dataset show our system achieves a state-of-the-art distinctiveness metric while preserving speech quality and intelligibility. Our code and demo is availible at https://github.com/BakerBunker/SALT ., Comment: 8 pages, 3 figures; Accepted by ASRU2023
Published: 2023

8. PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

Author: Lyu, Xiang, Cao, Yuhang, Wang, Qing, Yin, Jingjing, Yang, Yuguang, Zou, Pengpeng, Hu, Yanni, and Lu, Heng
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability, background noise, and reverberation. In this study, we propose PP-MeT system, a real-world personalized prompt based meeting transcription system, which consists of a clustering system, target-speaker voice activity detection (TS-VAD), and TS-ASR. Specifically, we utilize target-speaker embedding as a prompt in TS-VAD and TS-ASR modules in our proposed system. In constrast with previous system, we fully leverage pre-trained models for system initialization, thereby bestowing our approach with heightened generalizability and precision. Experiments on M2MeT2.0 Challenge dataset show that our system achieves a cp-CER of 11.27% on the test set, ranking first in both fixed and open training conditions.
Published: 2023

9. PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

Author: Yao, Jixun, Yang, Yuguang, Lei, Yi, Ning, Ziqian, Hu, Yanni, Pan, Yu, Yin, Jingjing, Zhou, Hongbin, Lu, Heng, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system., Comment: Accepted by ICASSP 2024
Published: 2023

10. DiaCorrect: Error Correction Back-end For Speaker Diarization

Author: Han, Jiangyu, Landini, Federico, Rohdin, Johan, Diez, Mireia, Burget, Lukas, Cao, Yuhang, Lu, Heng, and Cernocky, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initial system's outputs, DiaCorrect can automatically correct the initial speaker activities to minimize the diarization errors. Experiments on 2-speaker telephony data show that the proposed DiaCorrect can effectively improve the initial model's results. Our source code is publicly available at https://github.com/BUTSpeechFIT/diacorrect., Comment: Submitted to ICASSP 2024
Published: 2023

11. MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition

Author: Pan, Yu, Yang, Yuguang, Huang, Yuheng, Yao, Jixun, Yin, Jingjing, Hu, Yanni, Lu, Heng, Ma, Lei, and Zhao, Jianjun
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Despite notable progress, speech emotion recognition (SER) remains challenging due to the intricate and ambiguous nature of speech emotion, particularly in wild world. While current studies primarily focus on recognition and generalization abilities, our research pioneers an investigation into the reliability of SER methods in the presence of semantic data shifts and explores how to exert fine-grained control over various attributes inherent in speech signals to enhance speech emotion modeling. In this paper, we first introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER. Specifically, concentrating exclusively on the speech emotion attribute, a novel CNN-based SER model is presented to extract discriminative emotional representations, guided by additive margin softmax loss. Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes, termed Multiple Speech Attribute Control (MSAC), which empowers the proposed SER model to simultaneously capture fine-grained emotion-related features while mitigating the negative impact of emotion-agnostic representations. Furthermore, we make a first attempt to examine the reliability of the MSAC-SERNet framework using out-of-distribution detection methods. Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet not only consistently outperforms the baseline in all aspects, but achieves superior performance compared to state-of-the-art SER approaches., Comment: 12 pages
Published: 2023

12. METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer

Author: Zhu, Xinfa, Lei, Yi, Li, Tao, Zhang, Yongmao, Zhou, Hongbin, Lu, Heng, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion, and language factors in the speech signal will make a system produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes the Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift-based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization-based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS., Comment: 10 pages, 3 figures
Published: 2023

13. GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

Author: Pan, Yu, Hu, Yanni, Yang, Yuguang, Fei, Wen, Yao, Jixun, Lu, Heng, Ma, Lei, and Zhao, Jianjun
Subjects: Computer Science - Computation and Language, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pre-trained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best WAR of 83.16\%, which performs better than state-of-the-art SER methods., Comment: 5 pages
Published: 2023

14. HYBRIDFORMER: improving SqueezeFormer with hybrid attention and NSR mechanism

Author: Yang, Yuguang, Pan, Yu, Yin, Jingjing, Han, Jiangyu, Ma, Lei, and Lu, Heng
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast and efficient way. Specifically, we first incorporate linear attention (LA) and propose a hybrid LASA paradigm to increase the model's inference speed. Second, a hybrid neural architecture search (NAS) guided structural re-parameterization (SRep) mechanism, termed NSR, is proposed to enhance the ability of the model to extract local interactions. Extensive experiments conducted on the LibriSpeech dataset demonstrate that our proposed HybridFormer can achieve a 9.1% relative word error rate (WER) reduction over SqueezeFormer on the test-other dataset. Furthermore, when input speech is 30s, the HybridFormer can improve the model's inference speed up to 18%. Our source code is available online., Comment: Accepted by ICASSP2023
Published: 2023

15. LMEC: Learnable Multiplicative Absolute Position Embedding Based Conformer for Speech Recognition

Author: Yang, Yuguang, Pan, Yu, Yin, Jingjing, and Lu, Heng
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper proposes a Learnable Multiplicative absolute position Embedding based Conformer (LMEC). It contains a kernelized linear attention (LA) module called LMLA to solve the time-consuming problem for long sequence speech recognition as well as an alternative to the FFN structure. First, the ELU function is adopted as the kernel function of our proposed LA module. Second, we propose a novel Learnable Multiplicative Absolute Position Embedding (LM-APE) based re-weighting mechanism that can reduce the well-known quadratic temporal-space complexity of softmax self-attention. Third, we use Gated Linear Units (GLU) to substitute the Feed Forward Network (FFN) for better performance. Extensive experiments have been conducted on the public LibriSpeech datasets. Compared to the Conformer model with cosFormer style linear attention, our proposed method can achieve up to 0.63% word-error-rate improvement on test-other and improve the inference speed by up to 13% (left product) and 33% (right product) on the LA module., Comment: NCMMSC2022
Published: 2022

16. DiaCorrect: End-to-end error correction for speaker diarization

Author: Han, Jiangyu, Cao, Yuhang, Lu, Heng, and Long, Yanhua
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In recent years, speaker diarization has attracted widespread attention. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are quite complex. Motivated by spelling correction in automatic speech recognition (ASR), in this paper, we propose an end-to-end error correction framework, termed DiaCorrect, to refine the initial diarization results in a simple but efficient way. By exploiting the acoustic interactions between input mixture and its corresponding speaker activity, DiaCorrect could automatically adapt the initial speaker activity to minimize the diarization errors. Without bells and whistles, experiments on LibriSpeech based 2-speaker meeting-like data show that, the self-attentitive end-to-end neural diarization (SA-EEND) baseline with DiaCorrect could reduce its diarization error rate (DER) by over 62.4% from 12.31% to 4.63%. Our source code is available online at https://github.com/jyhan03/diacorrect., Comment: This paper has been superseded by arXiv:2309.08377 (merged from arXiv:2210.17189)
Published: 2022

17. Parallel measurements of vibrational modes in a few-layer graphene nanomechanical resonator using software-defined radio dongles

Author: Lu, Heng, Yang, Chen, Tian, Ye, Wang, Jue, Zhang, Ce, Zhang, Yubin, Chen, FengNan, Yan, Ying, and Moser, Joel
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics
Abstract: Software-defined radio dongles are small and inexpensive receivers well known to amateur radio enthusiasts. When connected to an antenna, they enable monitoring of a wide range of the radio spectrum by conditioning the input signal and transferring a downconverted version of it to a personal computer for software processing. Here, we employ a composite of two such dongles, interfaced with codes written in MATLAB and GNU Radio, as a measuring instrument to study the flexural vibrations of a few-layer graphene nanomechanical resonator. Instead of an antenna, we connect the dongles to the split output of a photodetector used to detect vibrations optically. We first perform a quantitative analysis of the dynamics of the first vibrational mode. We then measure the response of the first two vibrational modes in parallel. To illustrate our technique, we detect changes in the vibrational amplitude of both modes induced by periodic strain modulation with a delay of $\approx1$ ms between measurements. Last, we show that our software-based instrument can be employed to demodulate human voice encoded in the vibrations of our resonator. For parallel measurements of several frequency channels, and provided that the input signal is not too weak, our composite system may offer an alternative to the use of multiple lock-in amplifiers or multiple spectrum analyzers, with the distinct advantage of being cost-effective per frequency channel., Comment: 16 pages, 11 figures
Published: 2022
Full Text: View/download PDF

18. Improving Cross-lingual Speech Synthesis with Triplet Training Scheme

Author: Ye, Jianhao, Zhou, Hongbin, Su, Zhiba, He, Wendi, Ren, Kaimeng, Li, Lin, and Lu, Heng
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advances in cross-lingual text-to-speech (TTS) made it possible to synthesize speech in a language foreign to a monolingual speaker. However, there is still a large gap between the pronunciation of generated cross-lingual speech and that of native speakers in terms of naturalness and intelligibility. In this paper, a triplet training scheme is proposed to enhance the cross-lingual pronunciation by allowing previously unseen content and speaker combinations to be seen during training. Proposed method introduces an extra fine-tune stage with triplet loss during training, which efficiently draws the pronunciation of the synthesized foreign speech closer to those from the native anchor speaker, while preserving the non-native speaker's timbre. Experiments are conducted based on a state-of-the-art baseline cross-lingual TTS system and its enhanced variants. All the objective and subjective evaluations show the proposed method brings significant improvement in both intelligibility and naturalness of the synthesized cross-lingual speech.
Published: 2022

19. The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

Author: He, Maokui, Lv, Xiang, Zhou, Weilin, Yin, JingJing, Zhang, Xiaoqi, Wang, Yuxuan, Niu, Shutong, Cao, Yuhang, Lu, Heng, Du, Jun, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge. These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition. First, for data preparation and augmentation in training TS-VAD models, speech data containing both real meetings and simulated indoor conversations are used. Second, in refining results obtained after TS-VAD based decoding, we perform a series of post-processing steps to improve the VAD results needed to reduce diarization error rates (DERs). Tested on the ALIMEETING corpus, the newly released Mandarin meeting dataset used in M2MeT, we demonstrate that our proposed system can decrease the DER by up to 66.55/60.59% relatively when compared with classical clustering based diarization on the Eval/Test set.
Published: 2022

20. Imaging vibrations of locally gated, electromechanical few layer graphene resonators with a moving vacuum enclosure

Author: Lu, Heng, Yang, Chen, Tian, Ye, Lu, Jun, Xu, Fanqi, Chen, FengNan, Ying, Yan, Schädler, Kevin G., Wang, Chinhua, Koppens, Frank H. L., Reserbat-Plantey, Antoine, and Moser, Joel
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics
Abstract: Imaging the vibrations of nanomechanical resonators means measuring their flexural mode shapes from the dependence of their frequency response on in-plane position. Applied to two-dimensional resonators, this technique provides a wealth of information on the mechanical properties of atomically-thin membranes. We present a simple and robust system to image the vibrations of few layer graphene (FLG) resonators at room temperature and in vacuum with an in-plane displacement precision of $\approx0.20$ $\mu$m. It consists of a sturdy vacuum enclosure mounted on a three-axis micropositioning stage and designed for free space optical measurements of vibrations. The system is equipped with ultra-flexible radio frequency waveguides to electrically actuate resonators. With it we characterize the lowest frequency mode of a FLG resonator by measuring its frequency response as a function of position on the membrane. The resonator is suspended over a nanofabricated local gate electrode acting both as a mirror and as a capacitor plate to actuate vibrations at radio frequencies. From these measurements, we estimate the ratio of thermal expansion coefficient to thermal conductivity of the membrane, and we measure the effective mass of the lowest frequency mode. We complement our study with a globally gated resonator and image its first three vibration modes. There, we find that folds in the membrane locally suppress vibrations.
Published: 2021

21. Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training

Author: Guo, Haohan, Lu, Heng, Hu, Na, Zhang, Chunlei, Yang, Shan, Xie, Lei, Su, Dan, and Yu, Dong
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes an end-to-end adversarial singing voice conversion (EA-SVC) approach. It can directly generate arbitrary singing waveform by given phonetic posteriorgram (PPG) representing content, F0 representing pitch, and speaker embedding representing timbre, respectively. Proposed system is composed of three modules: generator $G$, the audio generation discriminator $D_{A}$, and the feature disentanglement discriminator $D_F$. The generator $G$ encodes the features in parallel and inversely transforms them into the target waveform. In order to make timbre conversion more stable and controllable, speaker embedding is further decomposed to the weighted sum of a group of trainable vectors representing different timbre clusters. Further, to realize more robust and accurate singing conversion, disentanglement discriminator $D_F$ is proposed to remove pitch and timbre related information that remains in the encoded PPG. Finally, a two-stage training is conducted to keep a stable and effective adversarial training process. Subjective evaluation results demonstrate the effectiveness of our proposed methods. Proposed system outperforms conventional cascade approach and the WaveNet based end-to-end approach in terms of both singing quality and singer similarity. Further objective analysis reveals that the model trained with the proposed two-stage training strategy can produce a smoother and sharper formant which leads to higher audio quality.
Published: 2020

22. TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis

Author: Tian, Qiao, Chen, Yi, Zhang, Zewang, Lu, Heng, Chen, Linghui, Xie, Lei, and Liu, Shan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, GAN based speech synthesis methods, such as MelGAN, have become very popular. Compared to conventional autoregressive based methods, parallel structures based generators make waveform generation process fast and stable. However, the quality of generated speech by autoregressive based neural vocoders, such as WaveRNN, is still higher than GAN. To address this issue, we propose a novel vocoder model: TFGAN, which is adversarially learned both in time and frequency domain. On one hand, we propose to discriminate ground-truth waveform from synthetic one in frequency domain for offering more consistency guarantees instead of only in time domain. On the other hand, in contrast to the conventionally frequency-domain STFT loss approach or feature map loss by discriminator to learn waveform, we propose a set of time-domain loss that encourage the generator to capture the waveform directly. TFGAN has nearly same synthesis speed as MelGAN, but the fidelity is significantly improved by our novel learning method. In our experiments, TFGAN shows the ability to achieve comparable mean opinion score (MOS) than autoregressive vocoder under speech synthesis context.
Published: 2020

23. FeatherTTS: Robust and Efficient attention based Neural TTS

Author: Tian, Qiao, Zhang, Zewang, Liu, Chao, Lu, Heng, Chen, Linghui, Wei, Bin, He, Pujiang, and Liu, Shan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Attention based neural TTS is elegant speech synthesis pipeline and has shown a powerful ability to generate natural speech. However, it is still not robust enough to meet the stability requirements for industrial products. Besides, it suffers from slow inference speed owning to the autoregressive generation process. In this work, we propose FeatherTTS, a robust and efficient attention-based neural TTS system. Firstly, we propose a novel Gaussian attention which utilizes interpretability of Gaussian attention and the strict monotonic property in TTS. By this method, we replace the commonly used stop token prediction architecture with attentive stop prediction. Secondly, we apply block sparsity on the autoregressive decoder to speed up speech synthesis. The experimental results show that our proposed FeatherTTS not only nearly eliminates the problem of word skipping, repeating in particularly hard texts and keep the naturalness of generated speech, but also speeds up acoustic feature generation by 3.5 times over Tacotron. Overall, the proposed FeatherTTS can be $35$x faster than real-time on a single CPU.
Published: 2020

24. Dyson's Equations for Quantum Gravity in the Hartree-Fock Approximation

Author: Hamber, Herbert W. and Yu, Lu Heng Sunny
Subjects: High Energy Physics - Theory, General Relativity and Quantum Cosmology
Abstract: Unlike scalar and gauge field theories in four dimensions, gravity is not perturbatively renormalizable and as a result perturbation theory is badly divergent. Often the method of choice for investigating nonperturbative effects has been the lattice formulation, and in the case of gravity the Regge-Wheeler lattice path integral lends itself well for that purpose. Nevertheless, lattice methods ultimately rely on extensive numerical calculations, leaving a desire for alternate calculations that can be done analytically. In this work we outline the Hartree-Fock approximation to quantum gravity, along lines which are analogous to what is done for scalar fields and gauge theories. The starting point is Dyson's equations, a closed set of integral equations which relate various physical amplitudes involving graviton propagators, vertex functions and proper self-energies. Such equations are in general difficult to solve, and as a result not very useful in practice, but nevertheless provide a basis for subsequent approximations. This is where the Hartree-Fock approximation comes in, whereby lowest order diagrams get partially dressed by the use of fully interacting Green's function and self-energies, which then lead to a set of self-consistent integral equations. Specifically, for quantum gravity one finds a nontrivial ultraviolet fixed point in Newton's constant G for spacetime dimensions greater than two, and nontrivial scaling dimensions between d=2 and d=4, above which one obtains Gaussian exponents. In addition, the Hartree-Fock approximation gives an explicit analytic expression for the renormalization group running of Newton's constant, suggesting gravitational antiscreening with Newton's G slowly increasing on cosmological scales., Comment: 71 pages, 21 figures. More typos fixed, references added
Published: 2020
Full Text: View/download PDF

25. Peking Opera Synthesis via Duration Informed Attention Network

Author: Wu, Yusong, Li, Shengchen, Yu, Chengzhu, Lu, Heng, Weng, Chao, Zhang, Liqiang, and Yu, Dong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing voice synthesis from a music score. In this work, we propose to deal with this issue and synthesize expressive Peking Opera singing from the music score based on the Duration Informed Attention Network (DurIAN) framework. To tackle the rhythm mismatch, Lagrange multiplier is used to find the optimal output phoneme duration sequence with the constraint of the given note duration from music score. As for the pitch contour mismatch, instead of directly inferring from music score, we adopt a pseudo music score generated from the real singing and feed it as input during training. The experiments demonstrate that with the proposed system we can synthesize Peking Opera singing voice with high-quality timbre, pitch and expressiveness., Comment: Accepted by INTERSPEECH 2020
Published: 2020

26. DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

Author: Zhang, Liqiang, Yu, Chengzhu, Lu, Heng, Weng, Chao, Zhang, Chunlei, Wu, Yusong, Xie, Xiang, Li, Zijin, and Yu, Dong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data.In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small.Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data., Comment: Accepted by Interspeech 2020
Published: 2020

27. Gravitational Fluctuations as an Alternative to Inflation III. Numerical Results

Author: Hamber, Herbert W., Yu, Lu Heng Sunny, and Kankanamge, Hasitha E. Pituwala
Subjects: Astrophysics - Cosmology and Nongalactic Astrophysics, General Relativity and Quantum Cosmology
Abstract: Power spectra play an important role in the theory of inflation, and their ability to reproduce current observational data to high accuracy is often considered a triumph of inflation, largely because of a lack of credible alternatives. In previous work we introduced an alternative picture for the cosmological power spectra based on the nonperturbative features of the quantum version of Einstein's gravity, instead of currently popular inflation models based on scalar fields. The key ingredients in this new picture are the appearance of a nontrivial gravitational vacuum condensate (directly related to the observed cosmological constant), and a calculable renormalization group running of Newton's G on cosmological scales. Results obtained previously were largely based on a semi-analytical treatment, and often suffered from the limitations of various approximations and simplifying assumptions. In this work, we extend and refine our previous calculations by laying out an updated and extended analysis, which now utilizes a set of suitably modified state-of-the-art numerical programs (ISiTGR, MGCAMB and MGCLASS) developed for observational cosmology. As a result, we are able to remove some of the approximations employed in our previous studies, leading to a number of novel and detailed physical predictions. These should help in potentially distinguish the vacuum condensate picture of quantum gravity from that of other models such as scalar field inflation. Here, besides the matter power spectrum P(k), we work out in detail predictions for what are referred to as the TT, TE, EE, BB angular spectra, as well as their closely related lensing spectra. However, the current limited precision of observational data today (especially on large angular scales) does not allow us yet to clearly prove or disprove either set of ideas., Comment: 44 pages, 10 figures
Published: 2020

28. AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN

Author: Zhang, Zewang, Tian, Qiao, Lu, Heng, Chen, Ling-Hui, and Liu, Shan
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper investigates how to leverage a DurIAN-based average model to enable a new speaker to have both accurate pronunciation and fluent cross-lingual speaking with very limited monolingual data. A weakness of the recently proposed end-to-end text-to-speech (TTS) systems is that robust alignment is hard to achieve, which hinders it to scale well with very limited data. To cope with this issue, we introduce AdaDurIAN by training an improved DurIAN-based average model and leverage it to few-shot learning with the shared speaker-independent content encoder across different speakers. Several few-shot learning tasks in our experiments show AdaDurIAN can outperform the baseline end-to-end system by a large margin. Subjective evaluations also show that AdaDurIAN yields higher mean opinion score (MOS) of naturalness and more preferences of speaker similarity. In addition, we also apply AdaDurIAN to emotion transfer tasks and demonstrate its promising performance., Comment: Submitted to InterSpeech 2020
Published: 2020

29. FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

Author: Tian, Qiao, Zhang, Zewang, Lu, Heng, Chen, Ling-Hui, and Liu, Shan
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than real-time on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet., Comment: Accepted by INTERSPEECH 2020
Published: 2020

30. Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network

Author: Wu, Yusong, Li, Shengchen, Yu, Chengzhu, Lu, Heng, Weng, Chao, Zhang, Liqiang, and Yu, Dong
Subjects: Computer Science - Computation and Language
Abstract: This paper presents a method that generates expressive singing voice of Peking opera. The synthesis of expressive opera singing usually requires pitch contours to be extracted as the training data, which relies on techniques and is not able to be manually labeled. With the Duration Informed Attention Network (DurIAN), this paper makes use of musical note instead of pitch contours for expressive opera singing synthesis. The proposed method enables human annotation being combined with automatic extracted features to be used as training data thus the proposed method gives extra flexibility in data collection for Peking opera singing synthesis. Comparing with the expressive singing voice of Peking opera synthesised by pitch contour based system, the proposed musical note based system produces comparable singing voice in Peking opera with expressiveness in various aspects.
Published: 2019

31. Learning Singing From Speech

Author: Zhang, Liqiang, Yu, Chengzhu, Lu, Heng, Weng, Chao, Wu, Yusong, Xie, Xiang, Li, Zijin, and Yu, Dong
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose an algorithm that is capable of synthesizing high quality target speaker's singing voice given only their normal speech samples. The proposed algorithm first integrate speech and singing synthesis into a unified framework, and learns universal speaker embeddings that are shareable between speech and singing synthesis tasks. Specifically, the speaker embeddings learned from normal speech via the speech synthesis objective are shared with those learned from singing samples via the singing synthesis objective in the unified training framework. This makes the learned speaker embedding a transferable representation for both speaking and singing. We evaluate the proposed algorithm on singing voice conversion task where the content of original singing is covered with the timbre of another speaker's voice learned purely from their normal speech samples. Our experiments indicate that the proposed algorithm generates high-quality singing voices that sound highly similar to target speaker's voice given only his or her normal speech samples. We believe that proposed algorithm will open up new opportunities for singing synthesis and conversion for broader users and applications., Comment: Submitted to ICASSP-2020
Published: 2019

32. PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

Author: Deng, Chengqi, Yu, Chengzhu, Lu, Heng, Weng, Chao, and Yu, Dong
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Singing voice conversion is to convert a singer's voice to another one's voice without changing singing content. Recent work shows that unsupervised singing voice conversion can be achieved with an autoencoder-based approach [1]. However, the converted singing voice can be easily out of key, showing that the existing approach cannot model the pitch information precisely. In this paper, we propose to advance the existing unsupervised singing voice conversion method proposed in [1] to achieve more accurate pitch translation and flexible pitch manipulation. Specifically, the proposed PitchNet added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network. Our evaluation shows that the proposed method can greatly improve the quality of the converted singing voice (2.92 vs 3.75 in MOS). We also demonstrate that the pitch of converted singing can be easily controlled during generation by changing the levels of the extracted pitch before passing it to the decoder network., Comment: Accepted by ICASSP 2020
Published: 2019

33. Gravitational Fluctuations as an Alternative to Inflation II. CMB Angular Power Spectrum

Author: Hamber, Herbert W. and Yu, Lu Heng Sunny
Subjects: General Relativity and Quantum Cosmology, Astrophysics - Cosmology and Nongalactic Astrophysics, High Energy Physics - Theory
Abstract: Power spectra always play an important role in the theory of inflation. In particular, the ability to reproduce the galaxy matter power spectrum and the CMB temperature angular power spectrum coefficients to high accuracy is often considered a triumph of inflation. In our previous work, we presented an alternative explanation for the matter power spectrum based on nonperturbative quantum field-theoretical methods applied to Einstein's gravity, instead of inflation models based on scalar fields. In this work, we review the basic concepts and provide further in-depth investigations. We first update the analysis with more recent data sets and error analysis, and then extend our predictions to the CMB angular spectrum coefficients, which we did not consider previously. Then we investigate further the potential freedoms and uncertainties associated with the fundamental parameters that are part of this picture, and show how recent cosmological data provides significant constraints on these quantities. Overall, we find good general consistency between theory and data, even potentially favoring the gravitationally-motivated picture at the largest scales. We summarize our results by outlining how this picture can be tested in the near future with increasingly accurate astrophysical measurements., Comment: 43 pages, 8 figures (typos fixed, references added)
Published: 2019
Full Text: View/download PDF

34. DurIAN: Duration Informed Attention Network For Multimodal Synthesis

Author: Yu, Chengzhu, Lu, Heng, Hu, Na, Yu, Meng, Weng, Chao, Xu, Kun, Liu, Peng, Tuo, Deyi, Kang, Shiyin, Lei, Guangzhi, Su, Dan, and Yu, Dong
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron. Furthermore, DurIAN can be used to generate high quality facial expression which can be synchronized with generated speech with/without parallel speech and face data. To improve the efficiency of speech generation, we also propose a multi-band parallel generation strategy on top of the WaveRNN model. The proposed Multi-band WaveRNN effectively reduces the total computational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audio that is 6 times faster than real time on a single CPU core. We show that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems. Finally, a simple yet effective approach for fine-grained control of expressiveness of speech and facial expression is introduced.
Published: 2019

35. Gravitational Fluctuations as an Alternative to Inflation

Author: Hamber, Herbert W. and Yu, Lu Heng Sunny
Subjects: General Relativity and Quantum Cosmology, Astrophysics - Cosmology and Nongalactic Astrophysics, High Energy Physics - Theory
Abstract: The ability to reproduce the observed matter power spectrum $P(k)$ to high accuracy is often considered as a triumph of inflation. In this work, we explore an alternative explanation for the power spectrum based on nonperturbative quantum field-theoretical methods applied to Einstein's gravity, instead of ones based on inflation models. In particular the power spectral index, which governs the slope on the $P(k)$ graph, can be related to critical scaling exponents derived from the Wilson renormalization group analysis. We find that the derived value fits favorably with the Sloan Digital Sky Survey telescope data. We then make use of the transfer functions, based only on the Boltzmann equations which describe states out of equilibrium, and Einstein's General Relativity, to extrapolate the power spectrum to the Cosmic Microwave Background (CMB) regime. We observe that the results fit rather well with current data. Our approach contrasts with the conventional explanation which uses inflation to generate the scale invariant Harrison-Zel'dovich spectrum on CMB scales, and uses the transfer function to extrapolate it to galaxy regime. The results we present here only assume quantum field theory and Einstein's Gravity, and hence provide a competing explanation of the power spectrum, without relying on the assumptions usually associated with inflationary models. At the end, we also outline several testable predictions in this picture that deviate from the conventional picture of inflation, and which hopefully will become verifiable in the near future with increasingly accurate measurements., Comment: 33 pages, 6 figures. One figure added following the July 2018 release of new Planck data. Typos fixed, more references added. Paper now conforms to the published version
Published: 2018
Full Text: View/download PDF

36. Linear networks based speaker adaptation for speech synthesis

Author: Huang, Zhiying, Lu, Heng, Lei, Ming, and Yan, Zhijie
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Speaker adaptation methods aim to create fair quality synthesis speech voice font for target speakers while only limited resources available. Recently, as deep neural networks based statistical parametric speech synthesis (SPSS) methods become dominant in SPSS TTS back-end modeling, speaker adaptation under the neural network based SPSS framework has also became an important task. In this paper, linear networks (LN) is inserted in multiple neural network layers and fine-tuned together with output layer for best speaker adaptation performance. When adaptation data is extremely small, the low-rank plus diagonal(LRPD) decomposition for LN is employed to make the adapted voice more stable. Speaker adaptation experiments are conducted under a range of adaptation utterances numbers. Moreover, speaker adaptation from 1) female to female, 2) male to female and 3) female to male are investigated. Objective measurement and subjective tests show that LN with LRPD decomposition performs most stable when adaptation data is extremely limited, and our best speaker adaptation (SA) model with only 200 adaptation utterances achieves comparable quality with speaker dependent (SD) model trained with 1000 utterances, in both naturalness and similarity to target speaker., Comment: 5 pages, 6 figures, accepted by ICASSP 2018
Published: 2018

37. Deep Feed-forward Sequential Memory Networks for Speech Synthesis

Author: Bi, Mengxiao, Lu, Heng, Zhang, Shiliang, Lei, Ming, and Yan, Zhijie
Subjects: Computer Science - Computation and Language
Abstract: The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody. However, the model complexity and inference cost of BLSTM prevents its usage in many runtime applications. Meanwhile, Deep Feed-forward Sequential Memory Networks (DFSMN) has shown its consistent out-performance over BLSTM in both word error rate (WER) and the runtime computation cost in speech recognition tasks. Since speech synthesis also requires to model long-term dependencies compared to speech recognition, in this paper, we investigate the Deep-FSMN (DFSMN) in speech synthesis. Both objective and subjective experiments show that, compared with BLSTM TTS method, the DFSMN system can generate synthesized speech with comparable speech quality while drastically reduce model complexity and speech generation time., Comment: 5 pages, ICASSP 2018
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

37 results on '"Lu, Heng"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources