Author: "Du, Jun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Du, Jun"' showing total 8,311 results

Start Over Author "Du, Jun"

8,311 results on '"Du, Jun"'

1. DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Author: Niu, Shu-Tong, Du, Jun, Wang, Ruo-Yu, Yang, Gao-Bin, Gao, Tian, Pan, Jia, and Hu, Yu
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries during decoding. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. We also perform the evaluation on the open LibriCSS dataset, achieving a new state-of-the-art performance on single-channel speech recognition.
Published: 2024

2. Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Author: Weng, Yuzhe, Wang, Haotian, Gao, Tian, Li, Kewei, Niu, Shutong, and Du, Jun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In multimodal sentiment analysis, collecting text data is often more challenging than video or audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) quality. To address this challenge, our study has developed a robust model that effectively integrates multimodal sentiment information, even in the absence of text modality. Specifically, we have developed a Double-Flow Self-Distillation Framework, including Unified Modality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA), which excels at processing both scenarios with complete modalities and those with missing text modality. In detail, when the text modality is missing, our framework uses the LLM-based model to simulate the text representation from the audio modality, while the MIA module supplements information from the other two modalities to make the simulated text representation similar to the real text representation. To further align the simulated and real representations, and to enable the model to capture the continuous nature of sample orders in sentiment valence regression tasks, we have also introduced the Rank-N Contrast (RNC) loss function. When testing on the CMU-MOSEI, our model achieved outstanding performance on MAE and significantly outperformed other models when text modality is missing. The code is available at: https://github.com/WarmCongee/SDUMC
Published: 2024

3. DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Author: Cheng, Hanbo, Lin, Limin, Liu, Chenyu, Xia, Pengcheng, Hu, Pengfei, Ma, Jiefeng, Du, Jun, and Pan, Jia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly available at https://github.com/Hanbo-Cheng/DAWN-pytorch.
Published: 2024

4. Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

Author: He, Mao-Kui, Du, Jun, Niu, Shu-Tong, Liu, Qing-Feng, and Lee, Chin-Hui
Subjects: Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker embedding empowers the network to handle scenarios with varying numbers of speakers. Our experimental results, obtained from various data sets, demonstrate the robustness of our proposed techniques in diverse acoustic environments. Even in scenarios with severely degraded video quality, our system attains performance levels comparable to the best available audio-visual systems.
Published: 2024

5. The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge

Author: Jiang, Ya, Lan, Hongbo, Du, Jun, Wang, Qing, and Niu, Shutong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In the two-person conversation scenario with one wearing smart glasses, transcribing and displaying the speaker's content in real-time is an intriguing application, providing a priori information for subsequent tasks such as translation and comprehension. Meanwhile, multi-modal data captured from the smart glasses is scarce. Therefore, we propose utilizing simulation data with multiple overlap rates and a one-to-one matching training strategy to narrow down the deviation for the model training between real and simulated data. In addition, combining IMU unit data in the model can assist the audio to achieve better real-time speech recognition performance.
Published: 2024

6. See then Tell: Enhancing Key Information Extraction with Vision Grounding

Author: Liu, Shuhang, Zhang, Zhenrong, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Wang, Qing, Zhang, Jianshu, and Liu, Chenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which often introduces significant latency, computational overhead, and errors. Current advanced image-to-text approaches, which bypass OCR, typically yield plain text outputs without corresponding vision grounding. In this paper, we introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. Distinctively, STNet utilizes a unique token to observe pertinent image areas, aided by a decoder that interprets physical coordinates linked to this token. Positioned at the outset of the answer text, the token allows the model to first see--observing the regions of the image related to the input question--and then tell--providing articulated textual responses. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets. Leveraging the advanced text processing prowess of GPT-4, we develop the TVG (TableQA with Vision Grounding) dataset, which not only provides text-based Question Answering (QA) pairs but also incorporates precise vision grounding for these pairs. Our approach demonstrates substantial advancements in KIE performance, achieving state-of-the-art results on publicly available datasets such as CORD, SROIE, and DocVQA. The code will also be made publicly available.
Published: 2024

7. Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

Author: Wang, Ruoyu, Niu, Shutong, Yang, Gaobin, Du, Jun, Qian, Shuangqing, Gao, Tian, and Pan, Jia
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system to enhance single-channel neural speaker diarization systems and recognition performance by utilizing spatial cues from multi-channel speech to provide more accurate initialization for each stage of neural speaker diarization (NSD) decoding: (1) Overlap detection and continuous speech separation (CSS) on multi-channel speech are used to obtain cleaner single speaker speech segments for clustering, followed by the first NSD decoding pass. (2) The results from the first pass initialize a complex Angular Central Gaussian Mixture Model (cACGMM) to estimate speaker-wise masks on multi-channel speech, and through Overlap-add and Mask-to-VAD, achieve initialization with lower speaker error (SpkErr), followed by the second NSD decoding pass. (3) The second decoding results are used for guided source separation (GSS), recognizing and filtering short segments containing less one word to obtain cleaner speech segments, followed by re-clustering and the final NSD decoding pass. We presented the progressively explored evaluation results from the CHiME-8 NOTSOFAR-1 (Natural Office Talkers in Settings Of Far-field Audio Recordings) challenge, demonstrating the effectiveness of our system and its contribution to improving recognition performance. Our final system achieved the first place in the challenge., Comment: 5 pages, Submitted to ICASSP 2025
Published: 2024

8. UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition

Author: Zhang, Zhenrong, Liu, Shuhang, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Zhang, Jianshu, and Hu, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively comprehend the textual semantics within tables, particularly for descriptive textual cells. In this paper, we introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a ``divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure. We further enhance our framework with the Vision Guider, which directs the model's focus towards pertinent areas, thereby boosting prediction accuracy. Additionally, we introduce the Language Guider to refine the model's capability to understand textual semantics in table images. Evaluated on prominent table structure datasets such as PubTabNet, PubTables1M, WTW, and iFLYTAB, UniTabNet achieves a new state-of-the-art performance, demonstrating the efficacy of our approach. The code will also be made publicly available.
Published: 2024

9. DocMamba: Efficient Document Pre-training with State Space Model

Author: Hu, Pengfei, Zhang, Zhenrong, Ma, Jiefeng, Liu, Shuhang, Du, Jun, and Zhang, Jianshu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation. The code will be available online.
Published: 2024

10. Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge

Author: Xue, Hongfei, Gong, Rong, Shao, Mingchen, Xu, Xin, Wang, Lezhi, Xie, Lei, Bu, Hui, Zhou, Jiaming, Qin, Yong, Du, Jun, Li, Ming, Zhang, Binbin, and Jia, Bin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The StutteringSpeech Challenge focuses on advancing speech technologies for people who stutter, specifically targeting Stuttering Event Detection (SED) and Automatic Speech Recognition (ASR) in Mandarin. The challenge comprises three tracks: (1) SED, which aims to develop systems for detection of stuttering events; (2) ASR, which focuses on creating robust systems for recognizing stuttered speech; and (3) Research track for innovative approaches utilizing the provided dataset. We utilizes an open-source Mandarin stuttering dataset AS-70, which has been split into new training and test sets for the challenge. This paper presents the dataset, details the challenge tracks, and analyzes the performance of the top systems, highlighting improvements in detection accuracy and reductions in recognition error rates. Our findings underscore the potential of specialized models and augmentation strategies in developing stuttered speech technologies., Comment: 8 pages, 2 figures, accepted by SLT 2024
Published: 2024

11. The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

Author: Niu, Shutong, Wang, Ruoyu, Du, Jun, Yang, Gaobin, Tu, Yanhui, Wu, Siyuan, Qian, Shuangqing, Wu, Huaxin, Xu, Haitao, Zhang, Xueyang, Zhong, Guolong, Yu, Xindi, Chen, Jieru, Wang, Mengzhi, Cai, Di, Gao, Tian, Wan, Genshun, Ma, Feng, Pan, Jia, and Gao, Jianqing
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively.
Published: 2024

12. Topological GCN for Improving Detection of Hip Landmarks from B-Mode Ultrasound Images

Author: Huang, Tianxiang, Shi, Jing, Jin, Ge, Li, Juncheng, Wang, Jun, Du, Jun, and Shi, Jun
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: The B-mode ultrasound based computer-aided diagnosis (CAD) has demonstrated its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants. However, due to effect of speckle noise in ultrasound im-ages, it is still a challenge task to accurately detect hip landmarks. In this work, we propose a novel hip landmark detection model by integrating the Topological GCN (TGCN) with an Improved Conformer (TGCN-ICF) into a unified frame-work to improve detection performance. The TGCN-ICF includes two subnet-works: an Improved Conformer (ICF) subnetwork to generate heatmaps and a TGCN subnetwork to additionally refine landmark detection. This TGCN can effectively improve detection accuracy with the guidance of class labels. Moreo-ver, a Mutual Modulation Fusion (MMF) module is developed for deeply ex-changing and fusing the features extracted from the U-Net and Transformer branches in ICF. The experimental results on the real DDH dataset demonstrate that the proposed TGCN-ICF outperforms all the compared algorithms.
Published: 2024

13. Constrained Optimization with Compressed Gradients: A Dynamical Systems Perspective

Author: Xia, Zhaoyue, Du, Jun, Jiang, Chunxiao, Poor, H. Vincent, and Ren, Yong
Subjects: Mathematics - Optimization and Control, Electrical Engineering and Systems Science - Systems and Control
Abstract: Gradient compression is of growing interests for solving constrained optimization problems including compressed sensing, noisy recovery and matrix completion under limited communication resources and storage costs. Convergence analysis of these methods from the dynamical systems viewpoint has attracted considerable attention because it provides a geometric demonstration towards the shadowing trajectory of a numerical scheme. In this work, we establish a tight connection between a continuous-time nonsmooth dynamical system called a perturbed sweeping process (PSP) and a projected scheme with compressed gradients. Theoretical results are obtained by analyzing the asymptotic pseudo trajectory of a PSP. We show that under mild assumptions a projected scheme converges to an internally chain transitive invariant set of the corresponding PSP. Furthermore, given the existence of a Lyapunov function $V$ with respect to a set $\Lambda$, convergence to $\Lambda$ can be established if $V(\Lambda)$ has an empty interior. Based on these theoretical results, we are able to provide a useful framework for convergence analysis of projected methods with compressed gradients. Moreover, we propose a provably convergent distributed compressed gradient descent algorithm for distributed nonconvex optimization. Finally, numerical simulations are conducted to confirm the validity of theoretical analysis and the effectiveness of the proposed algorithm.
Published: 2024

14. NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition

Author: Liu, Chenyu, Pan, Jia, Hu, Jinshui, Yin, Baocai, Yin, Bing, Chen, Mingjun, Liu, Cong, Du, Jun, and Liu, Qingfeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recently, Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding. Current methods typically approach HMER as an image-to-sequence generation task within an autoregressive (AR) encoder-decoder framework. However, these approaches suffer from several drawbacks: 1) a lack of overall language context, limiting information utilization beyond the current decoding step; 2) error accumulation during AR decoding; and 3) slow decoding speed. To tackle these problems, this paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER. NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph Decoder (PGD). Initially, the VAT tokenizes visible symbols and local relations at a coarse level. Subsequently, the PGD refines all tokens and establishes connectivities in parallel, leveraging comprehensive visual and linguistic contexts. Experiments on CROHME 2014/2016/2019 and HME100K datasets demonstrate that NAMER not only outperforms the current state-of-the-art (SOTA) methods on ExpRate by 1.93%/2.35%/1.49%/0.62%, but also achieves significant speedups of 13.7x and 6.7x faster in decoding time and overall FPS, proving the effectiveness and efficiency of NAMER., Comment: Accepted by ECCV 2024
Published: 2024

15. Out-of-Plane Polarization from Spin Reflection Induces Field-Free Spin-Orbit Torque Switching in Structures with Canted NiO Interfacial Moments

Author: Zhang, Zhe, Li, Zhuoyi, Chen, Yuzhe, Zhu, Fangyuan, Yan, Yu, Li, Yao, He, Liang, Du, Jun, Zhang, Rong, Wu, Jing, Lu, Xianyang, and Xu, Yongbing
Subjects: Physics - Applied Physics
Abstract: Realizing deterministic current-induced spin-orbit torque (SOT) magnetization switching, especially in systems exhibiting perpendicular magnetic anisotropy (PMA), typically requires the application of a collinear in-plane field, posing a challenging problem. In this study, we successfully achieve field-free SOT switching in the CoFeB/MgO system. In a Ta/CoFeB/MgO/NiO/Ta structure, spin reflection at the NiO interface, characterized by noncollinear spin structures with canted magnetization, generates a spin current with an out-of-plane spin polarization {\sigma}z. We confirm the contribution of {\sigma}z to the field-free SOT switching through measurements of the shift effect in the out-of-plane magnetization hysteresis loops under different currents. The incorporation of NiO as an antiferromagnetic insulator, mitigates the current shunting effect and ensures excellent thermal stability of the device. The sample with 0.8 nm MgO and 2 nm NiO demonstrates an impressive optimal switching ratio approaching 100% without an in-plane field. This breakthrough in the CoFeB/MgO system promises significant applications in spintronics, advancing us closer to realizing innovative technologies.
Published: 2024

16. Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Author: Jiang, Ya, Wang, Qing, Du, Jun, Hu, Maocheng, Hu, Pengfei, Liu, Zeyan, Cheng, Shi, Nian, Zhaoxu, Dong, Yuxuan, Cai, Mingqi, Fang, Xin, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swapping (VPS) technique to extend an audio channel swapping (ACS) method to an audio-visual joint augmentation. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge data set demonstrate significant improvements in SELD performances. Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks first place by effectively integrating the proposed techniques into a model ensemble., Comment: accepted by icme2024
Published: 2024

17. Quantum Compiling with Reinforcement Learning on a Superconducting Processor

Author: Wang, Z. T., Chen, Qiuhao, Du, Yuxuan, Yang, Z. H., Cai, Xiaoxia, Huang, Kaixuan, Zhang, Jingning, Xu, Kai, Du, Jun, Li, Yinan, Jiao, Yuling, Wu, Xingyao, Liu, Wu, Lu, Xiliang, Xu, Huikai, Jin, Yirong, Wang, Ruixia, Yu, Haifeng, and Zhao, S. P.
Subjects: Quantum Physics, Computer Science - Machine Learning
Abstract: To effectively implement quantum algorithms on noisy intermediate-scale quantum (NISQ) processors is a central task in modern quantum technology. NISQ processors feature tens to a few hundreds of noisy qubits with limited coherence times and gate operations with errors, so NISQ algorithms naturally require employing circuits of short lengths via quantum compilation. Here, we develop a reinforcement learning (RL)-based quantum compiler for a superconducting processor and demonstrate its capability of discovering novel and hardware-amenable circuits with short lengths. We show that for the three-qubit quantum Fourier transformation, a compiled circuit using only seven CZ gates with unity circuit fidelity can be achieved. The compiler is also able to find optimal circuits under device topological constraints, with lengths considerably shorter than those by the conventional method. Our study exemplifies the codesign of the software with hardware for efficient quantum compilation, offering valuable insights for the advancement of RL-based compilers.
Published: 2024

18. Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design

Author: Gao, Ming, Chen, Hang, Du, Jun, Xu, Xin, Guo, Hongxiao, Bu, Hui, Yang, Jianxing, Li, Ming, and Lee, Chin-Hui
Subjects: Computer Science - Computation and Language
Abstract: Smart home technology has gained widespread adoption, facilitating effortless control of devices through voice commands. However, individuals with dysarthria, a motor speech disorder, face challenges due to the variability of their speech. This paper addresses the wake-up word spotting (WWS) task for dysarthric individuals, aiming to integrate them into real-world applications. To support this, we release the open-source Mandarin Dysarthria Speech Corpus (MDSC), a dataset designed for dysarthric individuals in home environments. MDSC encompasses information on age, gender, disease types, and intelligibility evaluations. Furthermore, we perform comprehensive experimental analysis on MDSC, highlighting the challenges encountered. We also develop a customized dysarthria WWS system that showcases robustness in handling intelligibility and achieving exceptional performance. MDSC will be released on https://www.aishelltech.com/AISHELL_6B., Comment: to be published in Interspeech 2024
Published: 2024

19. SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding

Author: Ma, Jiefeng, Wang, Yan, Liu, Chenyu, Du, Jun, Hu, Yu, Zhang, Zhenrong, Hu, Pengfei, Wang, Qing, and Zhang, Jianshu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Accurately identifying and organizing textual content is crucial for the automation of document processing in the field of form understanding. Existing datasets, such as FUNSD and XFUND, support entity classification and relationship prediction tasks but are typically limited to local and entity-level annotations. This limitation overlooks the hierarchically structured representation of documents, constraining comprehensive understanding of complex forms. To address this issue, we present the SRFUND, a hierarchically structured multi-task form understanding benchmark. SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets, encompassing five tasks: (1) word to text-line merging, (2) text-line to entity merging, (3) entity category classification, (4) item table localization, and (5) entity-based full-document hierarchical structure recovery. We meticulously supplemented the original dataset with missing annotations at various levels of granularity and added detailed annotations for multi-item table regions within the forms. Additionally, we introduce global hierarchical structure dependencies for entity relation prediction tasks, surpassing traditional local key-value associations. The SRFUND dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese, making it a powerful tool for cross-lingual form understanding. Extensive experimental results demonstrate that the SRFUND dataset presents new challenges and significant opportunities in handling diverse layouts and global hierarchical structures of forms, thus providing deep insights into the field of form understanding. The original dataset and implementations of baseline methods are available at https://sprateam-ustc.github.io/SRFUND, Comment: NeurIPS 2024 Track on Datasets and Benchmarks under review
Published: 2024

20. AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection

Author: Gong, Rong, Xue, Hongfei, Wang, Lezhi, Xu, Xin, Li, Qisheng, Xie, Lei, Bu, Hui, Wu, Shaomei, Zhou, Jiaming, Qin, Yong, Zhang, Binbin, Du, Jun, Bin, Jia, and Li, Ming
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. Encompassing conversational and voice command reading speech, AS-70 includes verbatim manual transcription, rendering it suitable for various speech-related tasks. Furthermore, baseline systems are established, and experimental results are presented for ASR and stuttering event detection (SED) tasks. By incorporating this dataset into the model fine-tuning, significant improvements in the state-of-the-art ASR models, e.g., Whisper and Hubert, are observed, enhancing their inclusivity in addressing stuttered speech., Comment: Accepted by Interspeech 2024
Published: 2024

21. A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition

Author: Guo, Zilu, Wang, Qing, Du, Jun, Pan, Jia, Liu, Qing-Feng, and Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose a variance-preserving interpolation framework to improve diffusion models for single-channel speech enhancement (SE) and automatic speech recognition (ASR). This new variance-preserving interpolation diffusion model (VPIDM) approach requires only 25 iterative steps and obviates the need for a corrector, an essential element in the existing variance-exploding interpolation diffusion model (VEIDM). Two notable distinctions between VPIDM and VEIDM are the scaling function of the mean of state variables and the constraint imposed on the variance relative to the mean's scale. We conduct a systematic exploration of the theoretical mechanism underlying VPIDM and develop insights regarding VPIDM's applications in SE and ASR using VPIDM as a frontend. Our proposed approach, evaluated on two distinct data sets, demonstrates VPIDM's superior performances over conventional discriminative SE algorithms. Furthermore, we assess the performance of the proposed model under varying signal-to-noise ratio (SNR) levels. The investigation reveals VPIDM's improved robustness in target noise elimination when compared to VEIDM. Furthermore, utilizing the mid-outputs of both VPIDM and VEIDM results in enhanced ASR accuracies, thereby highlighting the practical efficacy of our proposed approach.
Published: 2024

22. All-voltage control of Giant Magnetoresistance

Author: Wei, Lujun, Zhang, Yiyang, Huang, Fei, Yang, Jiajv, Peng, Jincheng, Li, Yanghui, Lu, Yu, Chen, Jiarui, Liu, Tianyu, Pu, Yong, and Du, Jun
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics, Condensed Matter - Materials Science
Abstract: The aim of voltage control of magnetism is to reduce the power consumption of spintronic devices. For a spin valve, the magnetization directions of two ferromagnetic layers determine the giant magnetoresistance magnitude. However, achieving all-voltage manipulation of the magnetization directions between parallel and antiparallel states is a significant challenge. Here, we demonstrate that by utilizing two exchange-biased Co/IrMn bilayers with opposite pinning directions and with ferromagnetic coupling through the Ruderman-Kittel-Kasuya-Yosida interaction between two Co layers, the magnetization directions of the two ferromagnetic layers of a spin valve can be switched between parallel and antiparallel states through allvoltage-induced strain control. The all-voltage controlled giant magnetoresistance is repeatable and nonvolatile. The rotation of magnetizations in the two Co layers under voltages, from antiparallel to parallel states, occurs in opposite directions as revealed through simulations utilizing the Landau-Lifshitz-Gilbert equation. This work can provide valuable reference for the development of low-power all-voltage-controlled spintronic devices.
Published: 2024

23. QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Author: Li, Chang, Wang, Ruoyu, Liu, Lijuan, Du, Jun, Sun, Yixuan, Guo, Zilu, Zhang, Zhenrong, and Jiang, Yuan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering an innovative approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, including both high-fidelity audio waveforms and detailed text descriptions, which often constitute only a small portion of available datasets. In open-source datasets, issues such as low-quality music waveforms, mislabeling, weak labeling, and unlabeled data significantly hinder the development of music generation models. To address these challenges, we propose a novel paradigm for high-quality music generation that incorporates a quality-aware training strategy, enabling generative models to discern the quality of input music waveforms during training. Leveraging the unique properties of musical signals, we first adapted and implemented a masked diffusion transformer (MDT) model for the TTM task, demonstrating its distinct capacity for quality control and enhanced musicality. Additionally, we address the issue of low-quality captions in TTM with a caption refinement data processing approach. Experiments demonstrate our state-of-the-art (SOTA) performance on MusicCaps and the Song-Describer Dataset. Our demo page can be accessed at https://qa-mdt.github.io/.
Published: 2024

24. SEMv3: A Fast and Robust Approach to Table Separation Line Detection

Author: Qin, Chunxia, Zhang, Zhenrong, Hu, Pengfei, Liu, Chenyu, Ma, Jiefeng, and Du, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Table structure recognition (TSR) aims to parse the inherent structure of a table from its input image. The `"split-and-merge" paradigm is a pivotal approach to parse table structure, where the table separation line detection is crucial. However, challenges such as wireless and deformed tables make it demanding. In this paper, we adhere to the "split-and-merge" paradigm and propose SEMv3 (SEM: Split, Embed and Merge), a method that is both fast and robust for detecting table separation lines. During the split stage, we introduce a Keypoint Offset Regression (KOR) module, which effectively detects table separation lines by directly regressing the offset of each line relative to its keypoint proposals. Moreover, in the merge stage, we define a series of merge actions to efficiently describe the table structure based on table grids. Extensive ablation studies demonstrate that our proposed KOR module can detect table separation lines quickly and accurately. Furthermore, on public datasets (e.g. WTW, ICDAR-2019 cTDaR Historical and iFLYTAB), SEMv3 achieves state-of-the-art (SOTA) performance. The code is available at https://github.com/Chunchunwumu/SEMv3., Comment: 9 pages, 6 figures, 5 tables. Accepted by IJCAI2024 main track
Published: 2024

25. Multitask frame-level learning for few-shot sound event detection

Author: Zou, Liang, Yan, Genwei, Wang, Ruoyu, Du, Jun, Lei, Meng, Gao, Tian, and Fang, Xin
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been proposed to overcome these limitations, these strategies commonly face difficulties with prediction truncation caused by background noise. To alleviate this issue, we introduces an innovative multitask frame-level SED framework. In addition, we introduce TimeFilterAug, a linear timing mask for data augmentation, to increase the model's robustness and adaptability to diverse acoustic environments. The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2023., Comment: 6 pages, 4 figures, conference
Published: 2024

26. A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Author: Dai, Yusheng, Chen, Hang, Du, Jun, Wang, Ruoyu, Chen, Shihao, Ma, Jiefeng, Wang, Haotian, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated and validated through a series of comprehensive experiments using the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAVSR, Comment: the paper is accepted by CVPR2024
Published: 2024

27. Colloidal quantum dots enable tunable liquid-state lasers

Author: Hahm, Donghyo, Pinchetti, Valerio, Livache, Clément, Ahn, Namyoung, Noh, Jungchul, Li, Xueyang, Du, Jun, Wu, Kaifeng, and Klimov, Victor I.
Published: 2024
Full Text: View/download PDF

28. Adenosine mediates the amelioration of social novelty deficits during rhythmic light treatment of 16p11.2 deletion female mice

Author: Ju, Jun, Li, Xuanyi, Pan, Yifan, Du, Jun, Yang, Xinyi, Men, Siqi, Liu, Bo, Zhang, Zhenyu, Zhong, Haolin, Mai, Jinyuan, Wang, Yizheng, and Hou, Sheng-Tao
Published: 2024
Full Text: View/download PDF

29. Blue lasers using low-toxicity colloidal quantum dots

Author: Lin, Xuyang, Yang, Yang, Li, Xueyang, Lv, Yongshun, Wang, Zhaolong, Du, Jun, Luo, Xiaohan, Zhou, Dongjian, Xiao, Chunlei, and Wu, Kaifeng
Published: 2024
Full Text: View/download PDF

30. Guidelines for the diagnosis and treatment of neurally mediated syncope in children and adolescents (revised 2024)

Author: Wang, Cheng, Liao, Ying, Wang, Shuo, Tian, Hong, Huang, Min, Dong, Xiang-Yu, Shi, Lin, Li, Ya-Qi, Sun, Jing-Hui, Du, Jun-Bao, and Jin, Hong-Fang
Published: 2024
Full Text: View/download PDF

31. Generate, transform, and clean: the role of GANs and transformers in palm leaf manuscript generation and enhancement

Author: Thuon, Nimol, Du, Jun, Zhang, Zhenrong, Ma, Jiefeng, and Hu, Pengfei
Published: 2024
Full Text: View/download PDF

32. Age and mean platelet volume-based nomogram for predicting the therapeutic efficacy of metoprolol in Chinese pediatric patients with vasovagal syncope

Author: Du, Xiao-Juan, Huang, Ya-Qian, Li, Xue-Ying, Liao, Ying, Jin, Hong-Fang, and Du, Jun-Bao
Published: 2024
Full Text: View/download PDF

33. Assessing the quality of chlorophyll-a concentration products under multiple spatial and temporal scales

Author: Wang, Zheng, Zeng, Qun, Qiu, Shike, Wang, Chao, Sun, Tingting, and Du, Jun
Published: 2024
Full Text: View/download PDF

34. Envisioning the Future Role of 3D Wireless Networks in Preventing and Managing Disasters and Emergency Situations

Author: Alhammadi, Ahmed, Abraham, Anuj, Fakhreddine, Aymen, Tian, Yu, Du, Jun, and Bader, Faouzi
Subjects: Computer Science - Networking and Internet Architecture
Abstract: In an era marked by unprecedented climatic upheavals and evolving urban landscapes, the role of advanced communication networks in disaster prevention and management is becoming increasingly critical. This paper explores the transformative potential of 3D wireless networks, an innovative amalgamation of terrestrial, aerial, and satellite technologies, in enhancing disaster response mechanisms. We delve into a myriad of use cases, ranging from large facility evacuations to wildfire management, underscoring the versatility of these networks in ensuring timely communication, real-time situational awareness, and efficient resource allocation during crises. We also present an overview of cutting-edge prototypes, highlighting the practical feasibility and operational efficacy of 3D wireless networks in real-world scenarios. Simultaneously, we acknowledge the challenges posed by aspects such as cybersecurity, cross-border coordination, and physical layer technological hurdles, and propose future directions for research and development in this domain.
Published: 2024

35. On Inhomogeneous Infinite Products of Stochastic Matrices and Applications

Author: Xia, Zhaoyue, Du, Jun, Jiang, Chunxiao, Poor, H. Vincent, Han, Zhu, and Ren, Yong
Subjects: Mathematics - Optimization and Control, Electrical Engineering and Systems Science - Systems and Control
Abstract: With the growth of magnitude of multi-agent networks, distributed optimization holds considerable significance within complex systems. Convergence, a pivotal goal in this domain, is contingent upon the analysis of infinite products of stochastic matrices (IPSMs). In this work, convergence properties of inhomogeneous IPSMs are investigated. The convergence rate of inhomogeneous IPSMs towards an absolute probability sequence $\pi$ is derived. We also show that the convergence rate is nearly exponential, which coincides with existing results on ergodic chains. The methodology employed relies on delineating the interrelations among Sarymsakov matrices, scrambling matrices, and positive-column matrices. Based on the theoretical results on inhomogeneous IPSMs, we propose a decentralized projected subgradient method for time-varying multi-agent systems with graph-related stretches in (sub)gradient descent directions. The convergence of the proposed method is established for convex objective functions, and extended to non-convex objectives that satisfy Polyak-Lojasiewicz conditions. To corroborate the theoretical findings, we conduct numerical simulations, aligning the outcomes with the established theoretical framework.
Published: 2024

36. Bidirectional Trained Tree-Structured Decoder for Handwritten Mathematical Expression Recognition

Author: Cheng, Hanbo, Liu, Chenyu, Hu, Pengfei, Zhang, Zhenrong, Ma, Jiefeng, and Du, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR. Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models. However, existing methods fail to effectively utilize bidirectional context information during the inference stage. Furthermore, current bidirectional training methods are primarily designed for string decoders and cannot adequately generalize to tree decoders, which offer superior generalization capabilities and structural analysis capacity. In order to overcome these limitations, we propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure. Our method extends the bidirectional training strategy to the tree decoder, allowing for more effective training by leveraging bidirectional information. Additionally, we analyze the impact of the visual and linguistic perception of the HMER model separately and introduce the Shared Language Modeling (SLM) mechanism. Through the SLM, we enhance the model's robustness and generalization when dealing with visual ambiguity, particularly in scenarios with abundant training data. Our approach has been validated through extensive experiments, demonstrating its ability to achieve new state-of-the-art results on the CROHME 2014, 2016, and 2019 datasets, as well as the HME100K dataset. The code used in our experiments will be publicly available.
Published: 2023

37. NAMER: Non-autoregressive Modeling for Handwritten Mathematical Expression Recognition

Author: Liu, Chenyu, Pan, Jia, Hu, Jinshui, Yin, Baocai, Yin, Bing, Chen, Mingjun, Liu, Cong, Du, Jun, Liu, Qingfeng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

38. Experimental and Constitutive Modeling Investigations of the Mechanical Behaviors of a Gravelly Soil Material Under Large-Size Triaxial Cyclic Tests

Author: Zhang, Jiu-Chang, Du, Jun, Li, Dong, Qiu, Cheng-Jiang, Li, Biao, and Wang, Ru-Bin
Published: 2024
Full Text: View/download PDF

39. Effect of Sr Modification on the Microstructures, Mechanical Properties, and Thermal Conductivity of Hypoeutectic Al-13.6Cu-6Si Alloys

Author: Li, Chengbo, Hou, Huibing, Liu, Leilei, Huang, Chengyi, Ren, Yuelu, Du, Jun, and Yin, Cailiu
Published: 2024
Full Text: View/download PDF

40. Coupling of BiOCl Ultrathin Nanosheets with Carbon Quantum Dots for Enhanced Photocatalytic Performance

Author: Song, Pin, Fang, Xiaoyu, Jiang, Wei, Cao, Yuyang, Liu, Daobin, Wei, Shiqiang, Du, Jun, Sun, Lang, Zhao, Lei, Liu, Song, Zhou, Yuzhu, Di, Jun, Lv, Chade, Tang, Bijun, Yang, Jiefu, Kong, Tingting, and Xiong, Yujie
Published: 2024
Full Text: View/download PDF

41. CDSD: Chinese Dysarthria Speech Database

Author: Sun, Mengyi, Gao, Ming, Kang, Xinchen, Wang, Shiru, Du, Jun, Yao, Dengfeng, and Wang, Su-Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present the Chinese Dysarthria Speech Database (CDSD) as a valuable resource for dysarthria research. This database comprises speech data from 24 participants with dysarthria. Among these participants, one recorded an additional 10 hours of speech data, while each recorded one hour, resulting in 34 hours of speech material. To accommodate participants with varying cognitive levels, our text pool primarily consists of content from the AISHELL-1 dataset and speeches by primary and secondary school students. When participants read these texts, they must use a mobile device or the ZOOM F8n multi-track field recorder to record their speeches. In this paper, we elucidate the data collection and annotation processes and present an approach for establishing a baseline for dysarthric speech recognition. Furthermore, we conducted a speaker-dependent dysarthric speech recognition experiment using an additional 10 hours of speech data from one of our participants. Our research findings indicate that, through extensive data-driven model training, fine-tuning limited quantities of specific individual data yields commendable results in speaker-dependent dysarthric speech recognition. However, we observe significant variations in recognition results among different dysarthric speakers. These insights provide valuable reference points for speaker-dependent dysarthric speech recognition., Comment: 9 pages, 3 figures
Published: 2023

42. Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning

Author: Guo, Zilu, Du, Jun, and Lee, CHin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: In this paper, we explore a continuous modeling approach for deep-learning-based speech enhancement, focusing on the denoising process. We use a state variable to indicate the denoising process. The starting state is noisy speech and the ending state is clean speech. The noise component in the state variable decreases with the change of the state index until the noise component is 0. During training, a UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. In testing, we introduce a controlling factor as an embedding, ranging from zero to one, to the neural network, allowing us to control the level of noise reduction. This approach enables controllable speech enhancement and is adaptable to various application scenarios. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement, as evidenced by improvements in both objective speech measures and automatic speech recognition performance., Comment: We found the results are got from some wrong experimental settings. We needs new experiments
Published: 2023

43. Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

Author: Yang, Gaobin, He, Maokui, Niu, Shutong, Wang, Ruoyu, Yue, Yanyan, Qian, Shuangqing, Wu, Shilong, Du, Jun, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code will be available at https://github.com/liyunlongaaa/NSD-MS2S., Comment: Accepted by ICASSP 2024
Published: 2023

44. The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Author: Wu, Shilong, Wang, Chenxi, Chen, Hang, Dai, Yusheng, Zhang, Chenyue, Wang, Ruoyu, Lan, Hongbo, Du, Jun, Lee, Chin-Hui, Chen, Jingdong, Watanabe, Shinji, Siniscalchi, Sabato Marco, Scharenborg, Odette, Wang, Zhong-Qiu, Pan, Jia, and Gao, Jianqing
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward., Comment: 5 pages, 4 figures
Published: 2023

45. Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Author: Wang, Haotian, Xi, Yuxuan, Chen, Hang, Du, Jun, Song, Yan, Wang, Qing, Zhou, Hengshun, Wang, Chenxi, Ma, Jiefeng, Hu, Pengfei, Jiang, Ya, Cheng, Shi, Zhang, Jie, and Weng, Yuzhe
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Computer Science - Sound
Abstract: In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge., Comment: 5 pages, 4 figures
Published: 2023
Full Text: View/download PDF

46. The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

Author: Wang, Ruoyu, He, Maokui, Du, Jun, Zhou, Hengshun, Niu, Shutong, Chen, Hang, Yue, Yanyan, Yang, Gaobin, Wu, Shilong, Sun, Lei, Tu, Yanhui, Tang, Haitao, Qian, Shuangqing, Gao, Tian, Wang, Mengzhi, Wan, Genshun, Pan, Jia, Gao, Jianqing, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This technical report details our submission system to the CHiME-7 DASR Challenge, which focuses on speaker diarization and speech recognition under complex multi-speaker scenarios. Additionally, it also evaluates the efficiency of systems in handling diverse array devices. To address these issues, we implemented an end-to-end speaker diarization system and introduced a rectification strategy based on multi-channel spatial information. This approach significantly diminished the word error rates (WER). In terms of recognition, we utilized publicly available pre-trained models as the foundational models to train our end-to-end speech recognition models. Our system attained a Macro-averaged diarization-attributed WER (DA-WER) of 21.01% on the CHiME-7 evaluation set, which signifies a relative improvement of 62.04% over the official baseline system., Comment: Accepted by 2023 CHiME Workshop, Oral
Published: 2023

47. Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Author: Dai, Yusheng, Chen, Hang, Du, Jun, Ding, Xiaofei, Ding, Ning, Jiang, Feijun, and Lee, Chin-Hui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and specialized input representations between audio and visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. This enables accurate alignment of video and audio streams during visual model pre-training and cross-modal fusion. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends., Comment: 6 pages, 2 figures, published in ICME2023
Published: 2023

48. Count, Decode and Fetch: A New Approach to Handwritten Chinese Character Error Correction

Author: Hu, Pengfei, Ma, Jiefeng, Zhang, Zhenrong, Du, Jun, and Zhang, Jianshu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, handwritten Chinese character error correction has been greatly improved by employing encoder-decoder methods to decompose a Chinese character into an ideographic description sequence (IDS). However, existing methods implicitly capture and encode linguistic information inherent in IDS sequences, leading to a tendency to generate IDS sequences that match seen characters. This poses a challenge when dealing with an unseen misspelled character, as the decoder may generate an IDS sequence that matches a seen character instead. Therefore, we introduce Count, Decode and Fetch (CDF), a novel approach that exhibits better generalization towards unseen misspelled characters. CDF is mainly composed of three parts: the counter, the decoder, and the fetcher. In the first stage, the counter predicts the number of each radical class without the symbol-level position annotations. In the second stage, the decoder employs the counting information and generates the IDS sequence step by step. Moreover, by updating the counting information at each time step, the decoder becomes aware of the existence of each radical. With the decomposed IDS sequence, we can determine whether the given character is misspelled. If it is misspelled, the fetcher under the transductive transfer learning strategy predicts the ideal character that the user originally intended to write. We integrate our method into existing encoder-decoder models and significantly enhance their performance.
Published: 2023

49. Semi-supervised multi-channel speaker diarization with cross-channel attention

Author: Wu, Shilong, Du, Jun, He, Maokui, Niu, Shutong, Chen, Hang, Tang, Haitao, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most neural speaker diarization systems rely on sufficient manual training data labels, which are hard to collect under real-world scenarios. This paper proposes a semi-supervised speaker diarization system to utilize large-scale multi-channel training data by generating pseudo-labels for unlabeled data. Furthermore, we introduce cross-channel attention into the Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) to learn channel contextual information of speaker embeddings better. Experimental results on the CHiME-7 Mixer6 dataset which only contains partial speakers' labels of the training set, show that our system achieved 57.01% relative DER reduction compared to the clustering-based model on the development set. We further conducted experiments on the CHiME-6 dataset to simulate the scenario of missing partial training set labels. When using 80% and 50% labeled training data, our system performs comparably to the results obtained using 100% labeled data for training., Comment: 8 pages,3 figures
Published: 2023

50. Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement

Author: Guo, Zilu, Du, Jun, Lee, Chin-Hui, Gao, Yu, and Zhang, Wenbin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: The goal of this study is to implement diffusion models for speech enhancement (SE). The first step is to emphasize the theoretical foundation of variance-preserving (VP)-based interpolation diffusion under continuous conditions. Subsequently, we present a more concise framework that encapsulates both the VP- and variance-exploding (VE)-based interpolation diffusion methods. We demonstrate that these two methods are special cases of the proposed framework. Additionally, we provide a practical example of VP-based interpolation diffusion for the SE task. To improve performance and ease model training, we analyze the common difficulties encountered in diffusion models and suggest amenable hyper-parameters. Finally, we evaluate our model against several methods using a public benchmark to showcase the effectiveness of our approach
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

8,311 results on '"Du, Jun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources