Author: "Wang, Yujun" / Search Limiters: Full Text - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang, Yujun"' showing total 801 results

Start Over Author "Wang, Yujun" Search Limiters Full Text

801 results on '"Wang, Yujun"'

1. Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Author: Poncelet, Jakob, Wang, Yujun, and Van hamme, Hugo
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Continuous speech can be converted into a discrete sequence by deriving discrete units from the hidden features of self-supervised learned (SSL) speech models. Although SSL models are becoming larger and trained on more data, they are often sensitive to real-life distortions like additive noise or reverberation, which translates to a shift in discrete units. We propose a parameter-efficient approach to generate noise-robust discrete units from pre-trained SSL models by training a small encoder-decoder model, with or without adapters, to simultaneously denoise and discretise the hidden features of the SSL model. The model learns to generate a clean discrete sequence for a noisy utterance, conditioned on the SSL features. The proposed denoiser outperforms several pre-training methods on the tasks of noisy discretisation and noisy speech recognition, and can be finetuned to the target environment with a few recordings of unlabeled target data., Comment: Accepted at SLT2024
Published: 2024

2. Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Author: Liu, Jizhong, Li, Gang, Zhang, Junbo, Dinkel, Heinrich, Wang, Yongqing, Yan, Zhiyong, Wang, Yujun, and Wang, Bin
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A., Comment: Accepted by Interspeech 2024
Published: 2024

3. Bridging Language Gaps in Audio-Text Retrieval

Author: Yan, Zhiyong, Dinkel, Heinrich, Wang, Yongqing, Liu, Jizhong, Zhang, Junbo, Wang, Yujun, and Wang, Bin
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap., Comment: interspeech2024
Published: 2024

4. Scaling up masked audio encoder learning for general audio classification

Author: Dinkel, Heinrich, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, Wang, Yujun, and Wang, Bin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/., Comment: Interspeech 2024
Published: 2024

5. Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Author: Jiang, Yuepeng, Li, Tao, Yang, Fengyu, Xie, Lei, Meng, Meng, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness., Comment: 5 pages, 2 figures, accepted by Interspeech2024
Published: 2024

6. CED: Consistent ensemble distillation for audio tagging

Author: Dinkel, Heinrich, Wang, Yongqing, Yan, Zhiyong, Zhang, Junbo, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.
Published: 2023

7. Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction

Author: Guo, Aoqi, Wu, Junnan, Gao, Peng, Zhu, Wenbo, Guo, Qinwen, Gao, Dazhi, and Wang, Yujun
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input features and improve the estimation accuracy of the speech pre-separation module by avoiding information loss caused by direct dimensionality reduction in other models. Furthermore, we introduce a multi-head cross-attention mechanism that enhances the neural beamformer's perception of spatial information by making full use of the spatial information received by the array. Experimental results demonstrate that our approach, which incorporates a more reasonable target mask estimation network and a spatial information-based cross-attention mechanism into the neural beamformer, effectively improves speech separation performance.
Published: 2023

8. Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Author: Lin, Jiuxin, Wang, Peng, Dinkel, Heinrich, Chen, Jun, Wu, Zhiyong, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios., Comment: Proc. INTERSPEECH 2023, 2488-2492, doi: 10.21437/Interspeech.2023-218
Published: 2023

9. AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Author: Lin, Jiuxin, Cai, Xinyu, Dinkel, Heinrich, Chen, Jun, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, Wu, Zhiyong, Wang, Yujun, and Meng, Helen
Subjects: Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods., Comment: Accepted by ICASSP2023
Published: 2023

10. Understanding temporally weakly supervised training: A case study for keyword spotting

Author: Dinkel, Heinrich, Zhuang, Weiji, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The currently most prominent algorithm to train keyword spotting (KWS) models with deep neural networks (DNNs) requires strong supervision i.e., precise knowledge of the spoken keyword location in time. Thus, most KWS approaches treat the presence of redundant data, such as noise, within their training set as an obstacle. A common training paradigm to deal with data redundancies is to use temporally weakly supervised learning, which only requires providing labels on a coarse scale. This study explores the limits of DNN training using temporally weak labeling with applications in KWS. We train a simple end-to-end classifier on the common Google Speech Commands dataset with increased difficulty by randomly appending and adding noise to the training dataset. Our results indicate that temporally weak labeling can achieve comparable results to strongly supervised baselines while having a less stringent labeling requirement. In the presence of noise, weakly supervised models are capable to localize and extract target keywords without explicit supervision, leading to a performance increase compared to strongly supervised approaches.
Published: 2023

11. Streaming Audio Transformers for Online Audio Tagging

Author: Dinkel, Heinrich, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, Wang, Yujun, and Wang, Bin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT., Comment: Interspeech2024
Published: 2023

12. Effect of environmental factors on adsorption of ciprofloxacin from wastewater by microwave alkali modified fly ash

Author: Liu, Tonglinxi, Liu, Wen, Li, Xinyue, Wang, Hanyu, Lan, Yushan, Zhang, Shengmin, Wang, Yujun, and Liu, Huiqing
Published: 2024
Full Text: View/download PDF

13. Next-Generation Green Hydrogen: Progress and Perspective from Electricity, Catalyst to Electrolyte in Electrocatalytic Water Splitting

Author: Gao, Xueqing, Chen, Yutong, Wang, Yujun, Zhao, Luyao, Zhao, Xingyuan, Du, Juan, Wu, Haixia, and Chen, Aibing
Published: 2024
Full Text: View/download PDF

14. Microbiota-derived 3-phenylpropionic acid promotes myotube hypertrophy by Foxo3/NAD+ signaling pathway

Author: Li, Penglin, Feng, Xiaohua, Ma, Zewei, Yuan, Yexian, Jiang, Hongfeng, Xu, Guli, Zhu, Yunlong, Yang, Xue, Wang, Yujun, Zhu, Canjun, Wang, Songbo, Gao, Ping, Jiang, Qingyan, and Shu, Gang
Published: 2024
Full Text: View/download PDF

15. Continuous and low-carbon production of biomass flash graphene

Author: Zhu, Xiangdong, Lin, Litao, Pang, Mingyue, Jia, Chao, Xia, Longlong, Shi, Guosheng, Zhang, Shicheng, Lu, Yuanda, Sun, Liming, Yu, Fengbo, Gao, Jie, He, Zhelin, Wu, Xuan, Li, Aodi, Wang, Liang, Wang, Meiling, Cao, Kai, Fu, Weiguo, Chen, Huakui, Li, Gang, Zhang, Jiabao, Wang, Yujun, Yang, Yi, and Zhu, Yong-Guan
Published: 2024
Full Text: View/download PDF

16. Core–shell CoN@Co ultra-stable nanoparticles on biochar for contamination remediation in water and soil

Author: Yang, Qiang, Cui, Peixin, Liu, Cun, Fang, Guodong, Dang, Fei, Wang, Pengsheng, Wang, Shaobin, and Wang, Yujun
Published: 2024
Full Text: View/download PDF

17. Effects of straw returning on photochemical process and imidacloprid degradation in paddy water through a field experiment

Author: Li, Mabo, Zeng, Yu, Fu, Qinglong, Zhang, Mingyang, Chen, Ning, Wang, Yujun, Zhou, Dongmei, and Fang, Guodong
Published: 2024
Full Text: View/download PDF

18. Long-term biochar application influences phosphorus and associated iron and sulfur transformations in the rhizosphere

Author: Yuan, Jiahui, Chen, Hao, Chen, Guanglei, Pokharel, Prem, Chang, Scott X., Wang, Yujun, Wang, Dengjun, Yan, Xiaoyuan, Wang, Shenqiang, and Wang, Yu
Published: 2024
Full Text: View/download PDF

19. Exploring Representation Learning for Small-Footprint Keyword Spotting

Author: Cui, Fan, Guo, Liyong, Wang, Quandong, Gao, Peng, and Wang, Yujun
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we investigate representation learning for low-resource keyword spotting (KWS). The main challenges of KWS are limited labeled data and limited available device resources. To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model. First, local-global contrastive siamese networks (LGCSiam) are designed to learn similar utterance-level representations for similar audio samplers by proposed local-global contrastive loss without requiring ground-truth. Second, a self-supervised pretrained Wav2Vec 2.0 model is applied as a constraint module (WVC) to force the KWS model to learn frame-level acoustic representations. By the LGCSiam and WVC modules, the proposed small-footprint KWS model can be pretrained with unlabeled data. Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy, especially in the case of training on a small labeled dataset.
Published: 2023
Full Text: View/download PDF

20. Relate auditory speech to EEG by shallow-deep attention-based network

Author: Cui, Fan, Guo, Liyong, He, Lang, Liu, Jiyao, Pei, ErCheng, Wang, Yujun, and Jiang, Dongmei
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing, Quantitative Biology - Neurons and Cognition
Abstract: Electroencephalography (EEG) plays a vital role in detecting how brain responses to different stimulus. In this paper, we propose a novel Shallow-Deep Attention-based Network (SDANet) to classify the correct auditory stimulus evoking the EEG signal. It adopts the Attention-based Correlation Module (ACM) to discover the connection between auditory speech and EEG from global aspect, and the Shallow-Deep Similarity Classification Module (SDSCM) to decide the classification result via the embeddings learned from the shallow and deep layers. Moreover, various training strategies and data augmentation are used to boost the model robustness. Experiments are conducted on the dataset provided by Auditory EEG challenge (ICASSP Signal Processing Grand Challenge 2023). Results show that the proposed model has a significant gain over the baseline on the match-mismatch track.
Published: 2023

21. Improving Weakly Supervised Sound Event Detection with Causal Intervention

Author: Xin, Yifei, Yang, Dongchao, Cui, Fan, Wang, Yujun, and Zou, Yuexian
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we first establish a structural causal model (SCM) to reveal that the context is the main cause of co-occurrence confounders that mislead the model to learn spurious correlations between frames and clip-level labels. Based on the causal analysis, we propose a causal intervention (CI) method for WSSED to remove the negative impact of co-occurrence confounders by iteratively accumulating every possible context of each class and then re-projecting the contexts to the frame-level features for making the event boundary clearer. Experiments show that our method effectively improves the performance on multiple datasets and can generalize to various baseline models., Comment: Accepted by ICASSP2023
Published: 2023

22. Unified Keyword Spotting and Audio Tagging on Mobile Devices with Transformers

Author: Dinkel, Heinrich, Wang, Yongqing, Yan, Zhiyong, Zhang, Junbo, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Keyword spotting (KWS) is a core human-machine-interaction front-end task for most modern intelligent assistants. Recently, a unified (UniKW-AT) framework has been proposed that adds additional capabilities in the form of audio tagging (AT) to a KWS model. However, previous work did not consider the real-world deployment of a UniKW-AT model, where factors such as model size and inference speed are more important than performance alone. This work introduces three mobile-device deployable models named Unified Transformers (UiT). Our best model achieves an mAP of 34.09 on Audioset, and an accuracy of 97.76 on the public Google Speech Commands V1 dataset. Further, we benchmark our proposed approaches on four mobile platforms, revealing that the proposed UiT models can achieve a speedup of 2 - 6 times against a competitive MobileNetV2., Comment: ICASSP 2023
Published: 2023

23. Improve Bilingual TTS Using Dynamic Language and Phonology Embedding

Author: Yang, Fengyu, Luan, Jian, and Wang, Yujun
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In most cases, bilingual TTS needs to handle three types of input scripts: first language only, second language only, and second language embedded in the first language. In the latter two situations, the pronunciation and intonation of the second language are usually quite different due to the influence of the first language. Therefore, it is a big challenge to accurately model the pronunciation and intonation of the second language in different contexts without mutual interference. This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker. We introduce phonology embedding to capture the English differences between different phonology. Embedding mask is applied to language embedding for distinguishing information between different languages and to phonology embedding for focusing on English expression. We specially design an embedding strength modulator to capture the dynamic strength of language and phonology. Experiments show that our approach can produce significantly more natural and standard spoken English speech of the monolingual Chinese speaker. From analysis, we find that suitable phonology control contributes to better performance in different scenarios., Comment: Submitted to ICASSP2023
Published: 2022

24. An empirical study of weakly supervised audio tagging embeddings for general audio representations

Author: Dinkel, Heinrich, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning methods (BYOL-A) as feature extractors. Fourteen downstream tasks are used for evaluation ranging from music instrument classification to language classification. Our results indicate that AT pre-trained models are an excellent transfer learning choice for music, event, and emotion recognition tasks. Further, finetuning AT models can also benefit speech-related tasks such as keyword spotting and intent classification., Comment: Odyssey 2022
Published: 2022
Full Text: View/download PDF

25. UniKW-AT: Unified Keyword Spotting and Audio Tagging

Author: Dinkel, Heinrich, Wang, Yongqing, Yan, Zhiyong, Zhang, Junbo, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Within the audio research community and the industry, keyword spotting (KWS) and audio tagging (AT) are seen as two distinct tasks and research fields. However, from a technical point of view, both of these tasks are identical: they predict a label (keyword in KWS, sound event in AT) for some fixed-sized input audio segment. This work proposes UniKW-AT: An initial approach for jointly training both KWS and AT. UniKW-AT enhances the noise-robustness for KWS, while also being able to predict specific sound events and enabling conditional wake-ups on sound events. Our approach extends the AT pipeline with additional labels describing the presence of a keyword. Experiments are conducted on the Google Speech Commands V1 (GSCV1) and the balanced Audioset (AS) datasets. The proposed MobileNetV2 model achieves an accuracy of 97.53% on the GSCV1 dataset and an mAP of 33.4 on the AS evaluation set. Further, we show that significant noise-robustness gains can be observed on a real-world KWS dataset, greatly outperforming standard KWS approaches. Our study shows that KWS and AT can be merged into a single framework without significant performance degradation., Comment: Accepted in Interspeech2022
Published: 2022
Full Text: View/download PDF

26. Pseudo strong labels for large scale weakly supervised audio tagging

Author: Dinkel, Heinrich, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Large-scale audio tagging datasets inevitably contain imperfect labels, such as clip-wise annotated (temporally weak) tags with no exact on- and offsets, due to a high manual labeling cost. This work proposes pseudo strong labels (PSL), a simple label augmentation framework that enhances the supervision quality for large-scale weakly supervised audio tagging. A machine annotator is first trained on a large weakly supervised dataset, which then provides finer supervision for a student model. Using PSL we achieve an mAP of 35.95 balanced train subset of Audioset using a MobileNetV2 back-end, significantly outperforming approaches without PSL. An analysis is provided which reveals that PSL mitigates missing labels. Lastly, we show that models trained with PSL are also superior at generalizing to the Freesound datasets (FSD) than their weakly trained counterparts., Comment: Accepted by ICASSP 2022
Published: 2022

27. Learning Decoupling Features Through Orthogonality Regularization

Author: Wang, Li, Gu, Rongzhi, Zhuang, Weiji, Gao, Peng, Wang, Yujun, and Zou, Yuexian
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important to explore a method that can effectively extract common features while decoupling task-specific features. Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively. Experiments are conducted on Google Speech Commands Dataset (GSCD). The results demonstrate that the orthogonality regularization helps the network to achieve SOTA EER of 1.31% and 1.87% on KWS and SV, respectively., Comment: Accepted at ICASSP 2022
Published: 2022

28. Pathway dissection for inter-provincial transfer of pollutants and offsetting mechanisms across China

Author: Zhou, Baiqin, Li, Huiping, Zhao, Yuantian, Wang, Fangjun, Yang, Ruichun, Huang, Hui, Wang, Yujun, Fu, Shengnan, Hu, Mengxian, Lu, Zhiheng, and Pang, Weihai
Published: 2024
Full Text: View/download PDF

29. Heterogeneous and interactive effects of payments for ecosystem services on household income across giant panda nature reserves

Author: Zhang, Youqi, Wang, Yujun, Yang, Hongbo, Hull, Vanessa, Zhang, Jindong, Wang, Fang, Zhao, Zhiqiang, and Liu, Jianguo
Published: 2024
Full Text: View/download PDF

30. Seamless Prediction in China: A Review

Author: Ren, Hong-Li, Bao, Qing, Zhou, Chenguang, Wu, Jie, Gao, Li, Wang, Lin, Ma, Jieru, Tang, Yao, Liu, Yangke, Wang, Yujun, and Zhao, Zuosen
Published: 2023
Full Text: View/download PDF

31. Detect what you want: Target Sound Detection

Author: Yang, Dongchao, Wang, Helin, Zou, Yuexian, Cui, Fan, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Human beings can perceive a target sound type from a multi-source mixture signal by the selective auditory attention, however, such functionality was hardly ever explored in machine hearing. This paper addresses the target sound detection (TSD) task, which aims to detect the target sound signal from a mixture audio when a target sound's reference audio is given. We present a novel target sound detection network (TSDNet) which consists of two main parts: A conditional network which aims at generating a sound-discriminative conditional embedding vector representing the target sound, and a detection network which takes both the mixture audio and the conditional embedding vector as inputs and produces the detection result of the target sound. These two networks can be jointly optimized with a multi-task learning approach to further improve the performance. In addition, we study both strong-supervised and weakly-supervised strategies to train TSDNet and propose a data augmentation method by mixing two samples. To facilitate this research, we build a target sound detection dataset (\textit{i.e.} URBAN-TSD) based on URBAN-SED and UrbanSound8K datasets, and experimental results indicate our method could get the segment-based F scores of 76.3$\%$ and 56.8$\%$ on the strongly-labelled and weakly-labelled data respectively., Comment: Submitted to DCASE workshop2022
Published: 2021

32. Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

Author: Yang, Fengyu, Luan, Jian, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all encoder layers via attention. Moreover, the queries from encoder layers themselves are also helpful. Experiments prove the proposed methods can enhance the encoding of comprehensive syntactic and semantic information and produce more expressive emotional speech., Comment: accepted by ICASSP2022
Published: 2021

33. PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control

Author: He, Yunchao, Luan, Jian, and Wang, Yujun
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Sequence expansion between encoder and decoder is a critical challenge in sequence-to-sequence TTS. Attention-based methods achieve great naturalness but suffer from unstable issues like missing and repeating phonemes, not to mention accurate duration control. Duration-informed methods, on the contrary, seem to easily adjust phoneme duration but show obvious degradation in speech naturalness. This paper proposes PAMA-TTS to address the problem. It takes the advantage of both flexible attention and explicit duration models. Based on the monotonic attention mechanism, PAMA-TTS also leverages token duration and relative position of a frame, especially countdown information, i.e. in how many future frames the present phoneme will end. They help the attention to move forward along the token sequence in a soft but reliable control. Experimental results prove that PAMA-TTS achieves the highest naturalness, while has on-par or even better duration controllability than the duration-informed model., Comment: Accepted by ICASSP 2022. 5 pages, 4 figures, 3 tables. Audio samples are available at: https://pama-tts.github.io/
Published: 2021

34. Biodegradation of polybutylene succinate by an extracellular esterase from Pseudomonas mendocina

Author: Hu, Ting, Wang, Yujun, Ma, Li, Wang, Zhanyong, and Tong, Haibin
Published: 2024
Full Text: View/download PDF

35. Influence of soil pH and organic carbon content on the bioaccessibility of lead and copper in four spiked soils

Author: Cui, Jiaqi, Li, Hongbo, Shi, Yangxiaoxiao, Zhang, Feng, Hong, Zhineng, Fang, Di, Jiang, Jun, Wang, Yujun, and Xu, Renkou
Published: 2024
Full Text: View/download PDF

36. Study on the removal of Cd from dewatered sludge by combined extraction agents: Peel extract and compound chemical extraction agent

Author: Li, Hanyu, Jiao, Shuai, Tian, Lili, Wang, Yujun, and Li, Fei
Published: 2024
Full Text: View/download PDF

37. Large-scale production of cefazolin in a microreactor with a low impurity content and high yield

Author: Pan, Yongqi, Zhao, Shenyuan, Wang, Lijie, Zhang, Libin, Yang, Mengde, Wei, Baojun, Hu, Guogang, Ullah, Shafqat, Wang, Yujun, and Luo, Guangsheng
Published: 2024
Full Text: View/download PDF

38. A Separable Temporal Convolution Neural Network with Attention for Small-Footprint Keyword Spotting

Author: Hu, Shenghua, Wang, Jing, Wang, Yujun, Yang, Lidong, and Yang, Wenjing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. To solve this problem, this paper proposes a separable temporal convolution neural network with attention, it has a small number of parameters. Through the time convolution combined with attention mechanism, a small number of parameters model (32.2K) is implemented while maintaining high performance. The proposed model achieves 95.7% accuracy on the Google Speech Commands dataset, which is close to the performance of Res15(239K), the state-of-the-art model in KWS at present., Comment: arXiv admin note: text overlap with arXiv:2108.12146
Published: 2021

39. Separable Temporal Convolution plus Temporally Pooled Attention for Lightweight High-performance Keyword Spotting

Author: Hu, Shenghua, Wang, Jing, Wang, Yujun, and Yang, Wenjing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. In this paper, we propose a temporally pooled attention module which can capture global features better than the AveragePool. Besides, we design a separable temporal convolution network which leverages depthwise separable and temporal convolution to reduce the number of parameter and calculations. Finally, taking advantage of separable temporal convolution and temporally pooled attention, a efficient neural network (ST-AttNet) is designed for KWS system. We evaluate the models on the publicly available Google speech commands data sets V1. The number of parameters of proposed model (48K) is 1/6 of state-of-the-art TC-ResNet14-1.5 model (305K). The proposed model achieves a 96.6% accuracy, which is comparable to the TC-ResNet14-1.5 model (96.6%).
Published: 2021

40. Multi-channel Speech Enhancement with 2-D Convolutional Time-frequency Domain Features and a Pre-trained Acoustic Model

Author: Wang, Quandong, Wu, Junnan, Yan, Zhao, Qian, Sichong, Guo, Liyong, Fan, Lichun, Zhuang, Weiji, Gao, Peng, and Wang, Yujun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: We propose a multi-channel speech enhancement approach with a novel two-stage feature fusion method and a pre-trained acoustic model in a multi-task learning paradigm. In the first fusion stage, the time-domain and frequency-domain features are extracted separately. In the time domain, the multi-channel convolution sum (MCS) and the inter-channel convolution differences (ICDs) features are computed and then integrated with the first 2-D convolutional layer, while in the frequency domain, the log-power spectra (LPS) features from both original channels and super-directive beamforming outputs are combined with a second 2-D convolutional layer. To fully integrate the rich information of multi-channel speech, i.e. time-frequency domain features and the array geometry, we apply a third 2-D convolutional layer in the second fusion stage to obtain the final convolutional features. Furthermore, we propose to use a fixed clean acoustic model trained with the end-to-end lattice-free maximum mutual information criterion to enforce the enhanced output to have the same distribution as the clean waveform to alleviate the over-estimation problem of the enhancement task and constrain distortion. On the Task1 development dataset of ConferencingSpeech 2021 challenge, a PESQ improvement of 0.24 and 0.19 is attained compared to the official baseline and a recently proposed multi-channel separation method., Comment: 7 pages, 3 figures, accepted to APSIPA 2021, revised
Published: 2021

41. Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Author: Wu, Qinghua, Shen, Quanbo, Luan, Jian, and Wang, YuJun
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In multi-speaker speech synthesis, data from a number of speakers usually tend to have great diversity due to the fact that the speakers may differ largely in ages, speaking styles, emotions, and so on. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper proposes a high-capability speech synthesis system, called Msdtron, in which 1) a representation of the harmonic structure of speech, called excitation spectrogram, is designed to directly guide the learning of harmonics in mel-spectrogram. 2) conditional gated LSTM (CGLSTM) is proposed to control the flow of text content information through the network by re-weighting the gates of LSTM using speaker information. The experiments show a significant reduction in reconstruction error of mel-spectrogram in the training of the multi-speaker model, and a great improvement is observed in the subjective evaluation of speaker adapted model., Comment: Accepted by ICASSP-2022
Published: 2021

42. Factors driving antimony accumulation in soil-pakchoi and wheat agroecosystems: Insights and predictive models

Author: Wu, Tongliang, Zhang, Naichi, Liu, Cun, Ding, Changfeng, Zhang, Peng, Hu, Sainan, Huang, Yihang, Ge, Zixuan, Cui, Peixin, and Wang, Yujun
Published: 2024
Full Text: View/download PDF

43. Mesoporous silica encapsulated core-shell NiRh@NiO nanocatalyst for performance-enhanced ethanol steam reforming

Author: Xue, Qiangqiang, Li, Zhengwen, Yan, Binhang, Ullah, Shafqat, Wang, Yujun, and Luo, Guangsheng
Published: 2024
Full Text: View/download PDF

44. Determining soil conservation strategies: Ecological risk thresholds of arsenic and the influence of soil properties

Author: Huang, Yihang, Zhang, Naichi, Ge, Zixuan, Lv, Chen, Zhu, Linfang, Ding, Changfeng, Liu, Cun, Peng, Peiqin, Wu, Tongliang, and Wang, Yujun
Published: 2024
Full Text: View/download PDF

45. Spontaneous regeneration of active sites against catalyst deactivation

Author: Feng, Kai, Zhang, Jiajun, Li, Zhengwen, Liu, Xiaozhi, Pan, Yue, Wu, Zhiyi, Tian, Jiaming, Chen, Yuxin, Zhang, Chengcheng, Xue, Qiangqiang, He, Le, Zhang, Xiaohong, Wang, Yujun, Yang, Bin, Su, Dong, Luo, Kai Hong, and Yan, Binhang
Published: 2024
Full Text: View/download PDF

46. Cavitation erosion behavior of HVAF-sprayed Cu-based glassy composite coatings in NaCl solution

Author: Wang, Yujun, Wu, Yuping, Hong, Sheng, Cheng, Jiangbo, and Zhu, Shuaishuai
Published: 2024
Full Text: View/download PDF

47. Underlying reasons and factors associated with changes in earthworm activities in response to biochar amendment: a review

Author: Cui, Jiaqi, Jiang, Jun, Chang, E., Zhang, Feng, Guo, Lingyu, Fang, Di, Xu, Renkou, and Wang, Yujun
Published: 2023
Full Text: View/download PDF

48. Analysis of the dynamic changes in gut microbiota in patients with different severity in sepsis

Author: Liu, Yanli, Guo, Yanan, Hu, Su, Wang, Yujun, Zhang, Lijuan, Yu, Li, and Geng, Feng
Published: 2023
Full Text: View/download PDF

49. Bibliometric analysis of biochar research in 2021: a critical review for development, hotspots and trend directions

Author: Wu, Ping, Singh, Bhupinder Pal, Wang, Hailong, Jia, Zhifen, Wang, Yujun, and Chen, Wenfu
Published: 2023
Full Text: View/download PDF

50. Active and stable alcohol dehydrogenase-assembled hydrogels via synergistic bridging of triazoles and metal ions

Author: Chen, Qiang, Qu, Ge, Li, Xu, Feng, Mingjian, Yang, Fan, Li, Yanjie, Li, Jincheng, Tong, Feifei, Song, Shiyi, Wang, Yujun, Sun, Zhoutong, and Luo, Guangsheng
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

801 results on '"Wang, Yujun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources