Author: "Zhang, Xulong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Xulong"' showing total 571 results

Start Over Author "Zhang, Xulong"

571 results on '"Zhang, Xulong"'

1. Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Author: Shi, Haoxiang, Zhang, Xulong, Cheng, Ning, Zhang, Yong, Yu, Jun, Xiao, Jing, and Wang, Jianzong
Subjects: Computer Science - Computation and Language
Abstract: The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets., Comment: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)
Published: 2024

2. RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Author: Shi, Haoxiang, Wang, Jianzong, Zhang, Xulong, Cheng, Ning, Yu, Jun, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information., Comment: Accepted by the 8th APWeb-WAIM International Joint Conference on Web and Big Data
Published: 2024

3. RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

Author: Wang, Jianzong, Shi, Haoxiang, Luo, Kaiyi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Information Retrieval
Abstract: Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing technique designed for semi-paired cross-modal retrieval tasks, named Reconstruction Relations Embedded Hashing (RREH). RREH assumes that multi-modal data share a common subspace. For paired data, RREH explores the latent consistent information of heterogeneous modalities by seeking a shared representation. For unpaired data, to effectively capture the latent discriminative features, the high-order relationships between unpaired data and anchors are embedded into the latent subspace, which are computed by efficient linear reconstruction. The anchors are sampled from paired data, which improves the efficiency of hash learning. The RREH trains the underlying features and the binary encodings in a unified framework with high-order reconstruction relations preserved. With the well devised objective function and discrete optimization algorithm, RREH is designed to be scalable, making it suitable for large-scale datasets and facilitating efficient cross-modal retrieval. In the evaluation process, the proposed is tested with partially paired data to establish its superiority over several existing methods., Comment: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)
Published: 2024

4. MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Author: Li, Pengcheng, Wang, Jianzong, Zhang, Xulong, Zhang, Yong, Xiao, Jing, and Cheng, Ning
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively disentangle via a concise neural network. The proposed model utilizes Siamese encoders to learn clean representations, further enhanced by the designed mutual information estimator. The Siamese structure and the newly designed convolution module contribute to the lightweight of our model while ensuring performance in diverse voice conversion tasks. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

5. Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Author: Deng, Yimin, Wang, Jianzong, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

6. QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering

Author: Ouyang, Sheng, Wang, Jianzong, Zhang, Yong, Li, Zhitao, Liang, Ziqi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computation and Language
Abstract: Extractive Question Answering (EQA) in Machine Reading Comprehension (MRC) often faces the challenge of dealing with semantically identical but format-variant inputs. Our work introduces a novel approach, called the ``Query Latent Semantic Calibrator (QLSC)'', designed as an auxiliary module for existing MRC models. We propose a unique scaling strategy to capture latent semantic center features of queries. These features are then seamlessly integrated into traditional query and passage embeddings using an attention mechanism. By deepening the comprehension of the semantic queries-passage relationship, our approach diminishes sensitivity to variations in text format and boosts the model's capability in pinpointing accurate answers. Experimental results on robust Question-Answer datasets confirm that our approach effectively handles format-variant but semantically identical queries, highlighting the effectiveness and adaptability of our proposed method., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

7. EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Author: Liang, Ziqi, Wang, Jianzong, Zhang, Xulong, Zhang, Yong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

8. EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

Author: Wang, Jianzong, Liang, Ziqi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

9. CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

Author: Wang, Jianzong, Li, Pengcheng, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

10. Medical Speech Symptoms Classification via Disentangled Representation

Author: Wang, Jianzong, Li, Pengcheng, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Artificial Intelligence
Abstract: Intent is defined for understanding spoken language in existing works. Both textual features and acoustic features involved in medical speech contain intent, which is important for symptomatic diagnosis. In this paper, we propose a medical speech classification model named DRSC that automatically learns to disentangle intent and content representations from textual-acoustic data for classification. The intent representations of the text domain and the Mel-spectrogram domain are extracted via intent encoders, and then the reconstructed text feature and the Mel-spectrogram feature are obtained through two exchanges. After combining the intent from two domains into a joint representation, the integrated intent representation is fed into a decision layer for classification. Experimental results show that our model obtains an average accuracy rate of 95% in detecting 25 different medical symptoms., Comment: Accepted by the 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2024)
Published: 2024

11. Rapid humification of cotton stalk catalyzed by coal fly ash and its excellent cadmium passivation performance

Author: Zhou, Hao, Dang, Yan, Chen, Xinyu, Ivanets, Andrei, Ratko, Alexander A., Kouznetsova, Tatyana, Liu, Yongqi, Yang, Bo, Zhang, Xulong, Sun, Yiwei, He, Xiaoyan, Ren, Yanjie, and Su, Xintai
Published: 2024
Full Text: View/download PDF

12. Elimination of the confrontation between theory and experiment in flexoelectric Bi2GeO5

Author: Cao, Yuying, Zhang, Xulong, Zhou, Long, Liu, Hongfei, Gao, Hua, Zheng, Fu, and Ma, Zhi
Subjects: Condensed Matter - Materials Science, Condensed Matter - Mesoscale and Nanoscale Physics, Condensed Matter - Other Condensed Matter
Abstract: In this paper, we have investigated the flexoelectric effect of Bi2GeO5(BGO), successfully predicted the maximum flexoelectric coefficient of BGO, and tried to explore the difference between experimental and simulated flexoelectric coefficients., Comment: 15 pages,6 figures
Published: 2023

13. DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Author: Wang, Jianzong, Li, Pengcheng, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a sequential autoencoder. When given paired data, the module incorporates a trainable codebook that learns quantized representations under the supervision of the paired data. However, due to the limited paired data in low-resource scenario, these paired data are difficult to cover all phonemes. Then unpaired data is fed to expand the dynamic codebook by adding quantized representation vectors that are sufficiently distant from the existing ones during training. Experiments show that with less than 120 minutes of paired data, the proposed method outperforms existing methods in both subjective and objective metrics., Comment: Accepted by the 13th IEEE International Conference on Big Data and Cloud Computing (IEEE BDCloud 2023)
Published: 2023

14. CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Author: Wang, Jianzong, Deng, Yimin, Liang, Ziqi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit control of poses by video has already achieved by the state-of-art work. According to recent research, eye blinking has weak correlation with input audio which means eye blinks extraction from audio and generation are possible. Hence, we propose a GAN-based architecture to extract eye blink feature from input audio and reference video respectively and employ contrastive training between them, then embed it into the concatenated features of identity and poses to generate talking face images. Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes., Comment: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)
Published: 2023

15. CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Author: Deng, Yimin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks., Comment: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)
Published: 2023

16. Rhinoplasty with Mortise–Tenon Cartilaginous Framework for Caudal Septal Cartilage Defects

Author: Zhang, Xulong, Song, Zhen, Xu, Yihao, Zheng, Ruobing, Tian, Le, Guo, Junsheng, Wang, Huan, You, Jianjun, and Fan, Fei
Published: 2024
Full Text: View/download PDF

17. Clinical Application of Botulinum Toxin A on Nasal Reconstruction with Expanded Forehead Flap for Asian Patients

Author: Lin, Guangxian, Zhang, Xulong, Song, Zhen, Xu, Yihao, Wang, Huan, Zheng, Ruobing, Fan, Fei, and You, Jianjun
Published: 2024
Full Text: View/download PDF

18. Stock Volatility Prediction Based on Transformer Model Using Mixed-Frequency Data

Author: Liu, Wenting, Gui, Zhaozhong, Jiang, Guilin, Tang, Lihua, Zhou, Lichun, Leng, Wan, Zhang, Xulong, and Liu, Yujiang
Subjects: Quantitative Finance - Statistical Finance
Abstract: With the increasing volume of high-frequency data in the information age, both challenges and opportunities arise in the prediction of stock volatility. On one hand, the outcome of prediction using tradition method combining stock technical and macroeconomic indicators still leaves room for improvement; on the other hand, macroeconomic indicators and peoples' search record on those search engines affecting their interested topics will intuitively have an impact on the stock volatility. For the convenience of assessment of the influence of these indicators, macroeconomic indicators and stock technical indicators are then grouped into objective factors, while Baidu search indices implying people's interested topics are defined as subjective factors. To align different frequency data, we introduce GARCH-MIDAS model. After mixing all the above data, we then feed them into Transformer model as part of the training data. Our experiments show that this model outperforms the baselines in terms of mean square error. The adaption of both types of data under Transformer model significantly reduces the mean square error from 1.00 to 0.86., Comment: Accepted by the 7th APWeb-WAIM International Joint Conference on Web and Big Data. (APWeb 2023)
Published: 2023

19. Research on the Impact of Executive Shareholding on New Investment in Enterprises Based on Multivariable Linear Regression Model

Author: Zhou, Shanyi, Yan, Ning, Li, Zhijun, Geng, Mo, Zhang, Xulong, Si, Hongbiao, Tang, Lihua, Sun, Wenyuan, Zhang, Longda, and Cao, Yi
Subjects: Mathematics - Numerical Analysis, Economics - General Economics
Abstract: Based on principal-agent theory and optimal contract theory, companies use the method of increasing executives' shareholding to stimulate collaborative innovation. However, from the aspect of agency costs between management and shareholders (i.e. the first type) and between major shareholders and minority shareholders (i.e. the second type), the interests of management, shareholders and creditors will be unbalanced with the change of the marginal utility of executive equity incentives.In order to establish the correlation between the proportion of shares held by executives and investments in corporate innovation, we have chosen a range of publicly listed companies within China's A-share market as the focus of our study. Employing a multi-variable linear regression model, we aim to analyze this relationship thoroughly.The following models were developed: (1) the impact model of executive shareholding on corporate innovation investment; (2) the impact model of executive shareholding on two types of agency costs; (3)The model is employed to examine the mediating influence of the two categories of agency costs. Following both correlation and regression analyses, the findings confirm a meaningful and positive correlation between executives' shareholding and the augmentation of corporate innovation investments. Additionally, the results indicate that executive shareholding contributes to the reduction of the first type of agency cost, thereby fostering corporate innovation investment. However, simultaneously, it leads to an escalation in the second type of agency cost, thus impeding corporate innovation investment., Comment: Accepted by the 7th APWeb-WAIM International Joint Conference on Web and Big Data. (APWeb 2023)
Published: 2023

20. A Hierarchy-based Analysis Approach for Blended Learning: A Case Study with Chinese Students

Author: Ye, Yu, Zhang, Gongjin, Si, Hongbiao, Xu, Liang, Hu, Shenghua, Li, Yong, Zhang, Xulong, Hu, Kaiyu, and Ye, Fangzhou
Subjects: Computer Science - Computers and Society, Computer Science - Human-Computer Interaction
Abstract: Blended learning is generally defined as the combination of traditional face-to-face learning and online learning. This learning mode has been widely used in advanced education across the globe due to the COVID-19 pandemic's social distance restriction as well as the development of technology. Online learning plays an important role in blended learning, and as it requires more student autonomy, the quality of blended learning in advanced education has been a persistent concern. Existing literature offers several elements and frameworks regarding evaluating the quality of blended learning. However, most of them either have different favours for evaluation perspectives or simply offer general guidance for evaluation, reducing the completeness, objectivity and practicalness of related works. In order to carry out a more intuitive and comprehensive evaluation framework, this paper proposes a hierarchy-based analysis approach. Applying gradient boosting model and feature importance evaluation method, this approach mainly analyses student engagement and its three identified dimensions (behavioral engagement, emotional engagement, cognitive engagement) to eliminate some existing stubborn problems when it comes to blended learning evaluation. The results show that cognitive engagement and emotional engagement play a more important role in blended learning evaluation, implying that these two should be considered to improve for better learning as well as teaching quality., Comment: Accepted by the 7th APWeb-WAIM International Joint Conference on Web and Big Data. (APWeb 2023)
Published: 2023

21. An Empirical Study of Attention Networks for Semantic Segmentation

Author: Guo, Hao, Si, Hongbiao, Jiang, Guilin, Zhang, Wei, Liu, Zhiyan, Zhu, Xuanyi, Zhang, Xulong, and Liu, Yang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Semantic segmentation is a vital problem in computer vision. Recently, a common solution to semantic segmentation is the end-to-end convolution neural network, which is much more accurate than traditional methods.Recently, the decoders based on attention achieve state-of-the-art (SOTA) performance on various datasets. But these networks always are compared with the mIoU of previous SOTA networks to prove their superiority and ignore their characteristics without considering the computation complexity and precision in various categories, which is essential for engineering applications. Besides, the methods to analyze the FLOPs and memory are not consistent between different networks, which makes the comparison hard to be utilized. What's more, various methods utilize attention in semantic segmentation, but the conclusion of these methods is lacking. This paper first conducts experiments to analyze their computation complexity and compare their performance. Then it summarizes suitable scenes for these networks and concludes key points that should be concerned when constructing an attention network. Last it points out some future directions of the attention network., Comment: Accepted by the 7th APWeb-WAIM International Joint Conference on Web and Big Data. (APWeb 2023)
Published: 2023

22. Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Author: Luo, Kaiyi, Zhang, Xulong, Wang, Jianzong, Li, Huaxiong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR., Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Published: 2023

23. FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Author: Wang, Jianzong, Zhang, Xulong, Sun, Aolan, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration., Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Published: 2023

24. AOSR-Net: All-in-One Sandstorm Removal Network

Author: Si, Yazhong, Zhang, Xulong, Yang, Fan, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image mapping relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms., Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Published: 2023

25. DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Author: Qi, Zipeng, Zhang, Xulong, Cheng, Ning, Xiao, Jing, and Wang, Jianzong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features., Comment: submmit to ICASSP 2024
Published: 2023

26. Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality., Comment: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)
Published: 2023

27. Machine Unlearning Methodology base on Stochastic Teacher Network

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Sun, Yifu, Zhang, Chuanyao, and Xiao, Jing
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: The rise of the phenomenon of the "right to be forgotten" has prompted research on machine unlearning, which grants data owners the right to actively withdraw data that has been used for model training, and requires the elimination of the contribution of that data to the model. A simple method to achieve this is to use the remaining data to retrain the model, but this is not acceptable for other data owners who continue to participate in training. Existing machine unlearning methods have been found to be ineffective in quickly removing knowledge from deep learning models. This paper proposes using a stochastic network as a teacher to expedite the mitigation of the influence caused by forgotten data on the model. We performed experiments on three datasets, and the findings demonstrate that our approach can efficiently mitigate the influence of target data on the model within a single epoch. This allows for one-time erasure and reconstruction of the model, and the reconstruction model achieves the same performance as the retrained model., Comment: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)
Published: 2023

28. Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Author: Zhu, Kexin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Music Emotion Recognition involves the automatic identification of emotional elements within music tracks, and it has garnered significant attention due to its broad applicability in the field of Music Information Retrieval. It can also be used as the upstream task of many other human-related tasks such as emotional music generation and music recommendation. Due to existing psychology research, music emotion is determined by multiple factors such as the Timbre, Velocity, and Structure of the music. Incorporating multiple factors in MER helps achieve more interpretable and finer-grained methods. However, most prior works were uni-domain and showed weak consistency between arousal modeling performance and valence modeling performance. Based on this background, we designed a multi-domain emotion modeling method for instrumental music that combines symbolic analysis and acoustic analysis. At the same time, because of the rarity of music data and the difficulty of labeling, our multi-domain approach can make full use of limited data. Our approach was implemented and assessed using the publicly available piano dataset EMOPIA, resulting in a notable improvement over our baseline model with a 2.4% increase in overall accuracy, establishing its state-of-the-art performance., Comment: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)
Published: 2023

29. Sparks of Large Audio Models: A Survey and Outlook

Author: Latif, Siddique, Shoukat, Moazzam, Shamshad, Fahad, Usama, Muhammad, Ren, Yi, Cuayáhuitl, Heriberto, Wang, Wenwu, Zhang, Xulong, Togneri, Roberto, Cambria, Erik, and Schuller, Björn W.
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models., Comment: Under review, Repo URL: https://github.com/EmulationAI/awesome-large-audio-models
Published: 2023

30. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Author: Deng, Yimin, Tang, Huaizhen, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech., Comment: Accepted by the 31st ACM International Conference on Multimedia (MM2023)
Published: 2023
Full Text: View/download PDF

31. EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Author: Tang, Haobin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control., Comment: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)
Published: 2023

32. SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Author: Wang, Jianzong, Zhang, Xulong, Tang, Haobin, Sun, Aolan, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre-training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation., Comment: Accepted by IJCNN2023. 2023 International Joint Conference on Neural Networks (IJCNN2023)
Published: 2023

33. Work in Progress: Empowering Vocational Education with Automation Technology and PLC Integration

Author: Song, Lizhi, Zhang, Xulong, Wei, Xixin, Fu, Mingshen, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Auer, Michael E., editor, Langmann, Reinhard, editor, May, Dominik, editor, and Roos, Kim, editor
Published: 2024
Full Text: View/download PDF

34. An Empirical Study of Attention Networks for Semantic Segmentation

Author: Guo, Hao, Si, Hongbiao, Jiang, Guilin, Zhang, Wei, Liu, Zhiyan, Zhu, Xuanyi, Zhang, Xulong, Liu, Yang, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Song, Xiangyu, editor, Feng, Ruyi, editor, Chen, Yunliang, editor, Li, Jianxin, editor, and Min, Geyong, editor
Published: 2024
Full Text: View/download PDF

35. A Hierarchy-Based Analysis Approach for Blended Learning: A Case Study with Chinese Students

Author: Ye, Yu, Zhang, Gongjin, Si, Hongbiao, Xu, Liang, Hu, Shenghua, Li, Yong, Zhang, Xulong, Hu, Kaiyu, Ye, Fangzhou, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Song, Xiangyu, editor, Feng, Ruyi, editor, Chen, Yunliang, editor, Li, Jianxin, editor, and Min, Geyong, editor
Published: 2024
Full Text: View/download PDF

36. Stock Volatility Prediction Based on Transformer Model Using Mixed-Frequency Data

Author: Liu, Wenting, Gui, Zhaozhong, Jiang, Guilin, Tang, Lihua, Zhou, Lichun, Leng, Wan, Zhang, Xulong, Liu, Yujiang, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Song, Xiangyu, editor, Feng, Ruyi, editor, Chen, Yunliang, editor, Li, Jianxin, editor, and Min, Geyong, editor
Published: 2024
Full Text: View/download PDF

37. Research on the Impact of Executive Shareholding on New Investment in Enterprises Based on Multivariable Linear Regression Model

Author: Zhou, Shanyi, Yan, Ning, Li, Zhijun, Geng, Mo, Zhang, Xulong, Si, Hongbiao, Tang, Lihua, Sun, Wenyuan, Zhang, Longda, Cao, Yi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Song, Xiangyu, editor, Feng, Ruyi, editor, Chen, Yunliang, editor, Li, Jianxin, editor, and Min, Geyong, editor
Published: 2024
Full Text: View/download PDF

38. Comparing the Effectiveness of Betamethasone and Triamcinolone Acetonide in Multimodal Cocktail Intercostal Injection for Chest Pain After Harvesting Costal Cartilage: A Prospective, Double-Blind, Randomized Controlled Study

Author: Wang, Xin, Dong, Wenfang, Song, Zhen, Wang, Huan, You, Jianjun, Zheng, Ruobing, Xu, Yihao, Zhang, Xulong, Guo, Junsheng, Tian, Le, and Fan, Fei
Published: 2024
Full Text: View/download PDF

39. Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations

Author: Zhu, Kexin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Using deep learning methods to classify EEG signals can accurately identify people's emotions. However, existing studies have rarely considered the application of the information in another domain's representations to feature selection in the time-frequency domain. We propose a classification network of EEG signals based on the cross-domain feature fusion method, which makes the network more focused on the features most related to brain activities and thinking changes by using the multi-domain attention mechanism. In addition, we propose a two-step fusion method and apply these methods to the EEG emotion recognition network. Experimental results show that our proposed network, which combines multiple representations in the time-frequency domain and spatial domain, outperforms previous methods on public datasets and achieves state-of-the-art at present., Comment: Accepted by ICASSP 2023 - The 48th IEEE International Conference on Acoustics, Speech, & Signal Processing
Published: 2023

40. Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

Author: Zhang, Xulong, Tang, Haobin, Wang, Jianzong, Cheng, Ning, Luo, Jian, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Because of predicting all the target tokens in parallel, the non-autoregressive models greatly improve the decoding efficiency of speech recognition compared with traditional autoregressive models. In this work, we present dynamic alignment Mask CTC, introducing two methods: (1) Aligned Cross Entropy (AXE), finding the monotonic alignment that minimizes the cross-entropy loss through dynamic programming, (2) Dynamic Rectification, creating new training samples by replacing some masks with model predicted tokens. The AXE ignores the absolute position alignment between prediction and ground truth sentence and focuses on tokens matching in relative order. The dynamic rectification method makes the model capable of simulating the non-mask but possible wrong tokens, even if they have high confidence. Our experiments on WSJ dataset demonstrated that not only AXE loss but also the rectification method could improve the WER performance of Mask CTC., Comment: Accepted by ICASSP 2023
Published: 2023

41. QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Author: Tang, Haobin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention while transferring emotion from reference speech. We propose a multi-style extractor to extract style embedding from two different levels. While the sentence level represents emotion, the final syllable level represents intonation. For fine-grained intonation control, we use relative attributes to represent intonation intensity at the syllable level.Experiments have validated the effectiveness of QI-TTS for improving intonation expressiveness in emotional speech synthesis., Comment: Accepted by ICASSP 2023
Published: 2023

42. Improving Music Genre Classification from Multi-Modal Properties of Music and Genre Correlations Perspective

Author: Ru, Ganghui, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound
Abstract: Music genre classification has been widely studied in past few years for its various applications in music information retrieval. Previous works tend to perform unsatisfactorily, since those methods only use audio content or jointly use audio content and lyrics content inefficiently. In addition, as genres normally co-occur in a music track, it is desirable to capture and model the genre correlations to improve the performance of multi-label music genre classification. To solve these issues, we present a novel multi-modal method leveraging audio-lyrics contrastive loss and two symmetric cross-modal attention, to align and fuse features from audio and lyrics. Furthermore, based on the nature of the multi-label classification, a genre correlations extraction module is presented to capture and model potential genre correlations. Extensive experiments demonstrate that our proposed method significantly surpasses other multi-label music genre classification methods and achieves state-of-the-art result on Music4All dataset., Comment: Accepted by ICASSP 2023
Published: 2023

43. Work in Progress: Empowering Vocational Education with Automation Technology and PLC Integration

Author: Song, Lizhi, primary, Zhang, Xulong, additional, Wei, Xixin, additional, and Fu, Mingshen, additional
Published: 2024
Full Text: View/download PDF

44. An Empirical Study of Attention Networks for Semantic Segmentation

Author: Guo, Hao, primary, Si, Hongbiao, additional, Jiang, Guilin, additional, Zhang, Wei, additional, Liu, Zhiyan, additional, Zhu, Xuanyi, additional, Zhang, Xulong, additional, and Liu, Yang, additional
Published: 2024
Full Text: View/download PDF

45. Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Zhao, Mengyuan, Zhang, Zhiyong, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR). The improvement largely lies in the modeling of linguistic information by decoder. The decoder joint-optimized with an acoustic encoder renders the language model from ground-truth sequences in an auto-regressive manner during training. However, the training corpus of the decoder is limited to the speech transcriptions, which is far less than the corpus needed to train an acceptable language model. This leads to poor robustness of decoder. To alleviate this problem, we propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process, so that the decoder can be more robust. Our experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%. We also find that in joint CTC-Attention ASR model, decoder is more sensitive to linguistic information than acoustic information., Comment: Accepted by ECAISS2022, The Fourth International Workshop on Edge Computing and Artificial Intelligence based Sensor-Cloud System
Published: 2022

46. Improving Imbalanced Text Classification with Dynamic Curriculum Learning

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recent advances in pre-trained language models have improved the performance for text classification tasks. However, little attention is paid to the priority scheduling strategy on the samples during training. Humans acquire knowledge gradually from easy to complex concepts, and the difficulty of the same material can also vary significantly in different learning stages. Inspired by this insights, we proposed a novel self-paced dynamic curriculum learning (SPDCL) method for imbalanced text classification, which evaluates the sample difficulty by both linguistic character and model capacity. Meanwhile, rather than using static curriculum learning as in the existing research, our SPDCL can reorder and resample training data by difficulty criterion with an adaptive from easy to hard pace. The extensive experiments on several classification tasks show the effectiveness of SPDCL strategy, especially for the imbalanced dataset., Comment: Accepted by UEIoT2022, The 3rd International Workshop on Ubiquitous Electric Internet of Things
Published: 2022

47. MetaSpeech: Speech Effects Switch Along with Environment for Metaverse

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Metaverse expands the physical world to a new dimension, and the physical environment and Metaverse environment can be directly connected and entered. Voice is an indispensable communication medium in the real world and Metaverse. Fusion of the voice with environment effects is important for user immersion in Metaverse. In this paper, we proposed using the voice conversion based method for the conversion of target environment effect speech. The proposed method was named MetaSpeech, which introduces an environment effect module containing an effect extractor to extract the environment information and an effect encoder to encode the environment effect condition, in which gradient reversal layer was used for adversarial training to keep the speech content and speaker information while disentangling the environmental effects. From the experiment results on the public dataset of LJSpeech with four environment effects, the proposed model could complete the specific environment effect conversion and outperforms the baseline methods from the voice conversion task., Comment: Accepted by AI2OT2022, The Third International Workshop on Artificial Intelligence Applications in Internet of Things
Published: 2022

48. Semi-Supervised Learning Based on Reference Model for Low-resource TTS

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most previous neural text-to-speech (TTS) methods are mainly based on supervised learning methods, which means they depend on a large training dataset and hard to achieve comparable performance under low-resource conditions. To address this issue, we propose a semi-supervised learning method for neural TTS in which labeled target data is limited, which can also resolve the problem of exposure bias in the previous auto-regressive models. Specifically, we pre-train the reference model based on Fastspeech2 with much source data, fine-tuned on a limited target dataset. Meanwhile, pseudo labels generated by the original reference model are used to guide the fine-tuned model's training further, achieve a regularization effect, and reduce the overfitting of the fine-tuned model during training on the limited target data. Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis., Comment: Accepted by NMIC2022, The Fourth International Workshop on Network Meets Intelligent Computations
Published: 2022

49. Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Zhu, Kexin, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recovering the masked speech frames is widely applied in speech representation learning. However, most of these models use random masking in the pre-training. In this work, we proposed two kinds of masking approaches: (1) speech-level masking, making the model to mask more speech segments than silence segments, (2) phoneme-level masking, forcing the model to mask the whole frames of the phoneme, instead of phoneme pieces. We pre-trained the model via these two approaches, and evaluated on two downstream tasks, phoneme classification and speaker recognition. The experiments demonstrated that the proposed masking approaches are beneficial to improve the performance of speech representation., Comment: Accepted by MSN2022, The 18th International Conference on Mobility, Sensing and Networking
Published: 2022

50. Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we proposed Adapitch, a multi-speaker TTS method that makes adaptation of the supervised module with untranscribed data. We design two self supervised modules to train the text encoder and mel decoder separately with untranscribed data to enhance the representation of text and mel. To better handle the prosody information in a synthesized voice, a supervised TTS module is designed conditioned on content disentangling of pitch, text, and speaker. The training phase was separated into two parts, pretrained and fixed the text encoder and mel decoder with unsupervised mode, then the supervised mode on the disentanglement of TTS. Experiment results show that the Adaptich achieved much better quality than baseline methods., Comment: Accepted by MSN2022, The 18th International Conference on Mobility, Sensing and Networking
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

571 results on '"Zhang, Xulong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources