Author: "Yuexian Zou" / Language: undetermined - Searchworks@Jio Institute Digital Library Search Results

1. Diffsound: Discrete Diffusion Model for Text-to-Sound Generation

Author: Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computational Mathematics, Artificial Intelligence (cs.AI), Acoustics and Ultrasonics, Audio and Speech Processing (eess.AS), Computer Science - Artificial Intelligence, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical and Electronic Engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the decoder significantly influences the generation performance. Thus, we focus on designing a good decoder in this study. We begin with the traditional autoregressive decoder, which has been proved as a state-of-the-art method in previous sound generation works. However, the AR decoder always predicts the mel-spectrogram tokens one by one in order, which introduces the unidirectional bias and accumulation of errors problems. Moreover, with the AR decoder, the sound generation time increases linearly with the sound duration. To overcome the shortcomings introduced by AR decoders, we propose a non-autoregressive decoder based on the discrete diffusion model, named Diffsound. Specifically, the Diffsound predicts all of the mel-spectrogram tokens in one step and then refines the predicted tokens in the next step, so the best-predicted results can be obtained after several steps. Our experiments show that our proposed Diffsound not only produces better text-to-sound generation results when compared with the AR decoder but also has a faster generation speed, e.g., MOS: 3.56 \textit{v.s} 2.786, and the generation speed is five times faster than the AR decoder., Accepted by TASLP2022
Published: 2023

2. Improving Retrieval-Based Dialogue System Via Syntax-Informed Attention

Author: Tengtao Song, Nuo Chen, Ji Jiang, Zhihong Zhu, and Yuexian Zou
Published: 2023

3. M3ST: Mix at Three Levels for Speech Translation

Author: Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, and Yuexian Zou
Published: 2023

4. RR-Net: Relation Reasoning for End-to-End Human-Object Interaction Detection

Author: Can Zhang, Dongming Yang, Yuexian Zou, Jie Chen, and Meng Cao
Subjects: End-to-end principle, Relation (database), business.industry, Computer science, Media Technology, Net (polyhedron), Computer vision, Artificial intelligence, Electrical and Electronic Engineering, business, Object (computer science)
Published: 2022

5. Adaptive Curriculum Learning for Video Captioning

Author: Shanhao Li, Bang Yang, and Yuexian Zou
Subjects: General Computer Science, General Engineering, General Materials Science
Published: 2022

6. Improving Weakly Supervised Sound Event Detection with Causal Intervention

Author: Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, and Yuexian Zou
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we first establish a structural causal model (SCM) to reveal that the context is the main cause of co-occurrence confounders that mislead the model to learn spurious correlations between frames and clip-level labels. Based on the causal analysis, we propose a causal intervention (CI) method for WSSED to remove the negative impact of co-occurrence confounders by iteratively accumulating every possible context of each class and then re-projecting the contexts to the frame-level features for making the event boundary clearer. Experiments show that our method effectively improves the performance on multiple datasets and can generalize to various baseline models., Comment: Accepted by ICASSP2023
Published: 2023
Full Text: View/download PDF

7. FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning

Author: Zhongjie Ye, Yuqing Wang, Helin Wang, Dongchao Yang, and Yuexian Zou
Published: 2022

8. 3CMLF: Three-Stage Curriculum-Based Mutual Learning Framework for Audio-Text Retrieval

Author: Yi-Wen Chao, Dongchao Yang, Rongzhi Gu, and Yuexian Zou
Published: 2022

9. Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio Classification

Author: Yifei Xin, Dongchao Yang, and Yuexian Zou
Published: 2022

10. Deep Motion Prior for Weakly-Supervised Temporal Action Localization

Author: Meng Cao, Can Zhang, Long Chen, Mike Zheng Shou, and Yuexian Zou
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computer Graphics and Computer-Aided Design, Software
Abstract: Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness. Specifically, a motion graph is introduced to model motionness based on the local motion carrier (e.g., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3., Comment: Accepted by IEEE Transactions on Image Processing (TIP)
Published: 2022

11. DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Author: Xuancheng Ren, Fenglin Liu, Xian Wu, Shen Ge, Yuexian Zou, Xu Sun, and Wei Fan
Subjects: FOS: Computer and information sciences, Closed captioning, Computer Science - Computation and Language, Modalities, General Computer Science, Computer science, Process (engineering), Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Space (commercial competition), 03 medical and health sciences, Task (computing), 0302 clinical medicine, Human–computer interaction, 030221 ophthalmology & optometry, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Language model, Set (psychology), Computation and Language (cs.CL), Natural language
Abstract: Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image-sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts., Published in ACM TKDD2022 (ACM Transactions on Knowledge Discovery from Data)
Published: 2021

12. Audio-Oriented Multimodal Machine Comprehension via Dynamic Inter- and Intra-modality Attention

Author: Zhiqi Huang, Fenglin Liu, Xian Wu, Shen Ge, Helin Wang, Wei Fan, and Yuexian Zou
Subjects: General Medicine
Abstract: While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Machine Reading Comprehension task which mines textual inputs (paragraphs and questions) to predict the answers (choices or text spans). However, there are a lot of MC tasks that accept audio input in addition to the textual input, e.g. English listening comprehension test. In this paper, we target the problem of Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer questions based on the given audio and textual information. To solve this problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to effectively fuse the two modalities (audio and textual). DIIA can work as an independent component and thus be easily integrated into existing MC models. Moreover, we further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio. As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models. Experimental results and analysis prove the effectiveness of the proposed approaches. First, the proposed DIIA boosts the baseline models by up to 21.08% in terms of accuracy; Second, under the unimodal scenarios, the MKD module allows our multimodal MC model to significantly outperform the unimodal models by up to 18.87%, which are trained and tested with only audio or textual data.
Published: 2021

13. Utilizing Text-based Augmentation to Enhance Video Captioning

Author: Shanhao Li, Bang Yang, and Yuexian Zou
Published: 2022

14. Leveraging Bilinear Attention to Improve Spoken Language Understanding

Author: Dongsheng Chen, Zhiqi Huang, and Yuexian Zou
Published: 2022

15. Federated Learning for Vision-and-Language Grounding Problems

Author: Fenglin Liu, Yuexian Zou, Xian Wu, Wei Fan, and Shen Ge
Subjects: Closed captioning, Information retrieval, Ground, Computer science, 02 engineering and technology, General Medicine, 010501 environmental sciences, 01 natural sciences, Federated learning, 0202 electrical engineering, electronic engineering, information engineering, Question answering, 020201 artificial intelligence & image processing, Transfer of learning, 0105 earth and related environmental sciences
Abstract: Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.
Published: 2020

16. Modeling Label Dependencies for Audio Tagging With Graph Convolutional Network

Author: Helin Wang, Yuexian Zou, Dading Chong, and Wenwu Wang
Subjects: business.industry, Computer science, Applied Mathematics, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Graph, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Symmetric matrix, Graph (abstract data type), Adjacency matrix, Artificial intelligence, Electrical and Electronic Engineering, business
Abstract: As a multi-label classification task, audio tagging aims to predict the presence or absence of certain sound events in an audio recording. Existing works in audio tagging do not explicitly consider the probabilities of the co-occurrences between sound events, which is termed as the label dependencies in this study. To address this issue, we propose to model the label dependencies via a graph-based method, where each node of the graph represents a label. An adjacency matrix is constructed by mining the statistical relations between labels to represent the graph structure information, and a graph convolutional network (GCN) is employed to learn node representations by propagating information between neighboring nodes based on the adjacency matrix, which implicitly models the label dependencies. The generated node representations are then applied to the acoustic representations for classification. Experiments on Audioset show that our method achieves a state-of-the-art mean average precision (mAP) of 0:434.
Published: 2020

17. LocVTP: Video-Text Pre-training for Temporal Localization

Author: Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, and Yuexian Zou
Published: 2022

18. CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Author: Bang Yang, Tong Zhang, and Yuexian Zou
Published: 2022

19. Visual Relation-Aware Unsupervised Video Captioning

Author: Puzhao Ji, Meng Cao, and Yuexian Zou
Published: 2022

20. RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Author: Dongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, and WenWu Wang
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Target sound detection (TSD) aims to detect the target sound from a mixture audio given the reference information. Previous methods use a conditional network to extract a sound-discriminative embedding from the reference audio, and then use it to detect the target sound from the mixture audio. However, the network performs much differently when using different reference audios (e.g. performs poorly for noisy and short-duration reference audios), and tends to make wrong decisions for transient events (i.e. shorter than $1$ second). To overcome these problems, in this paper, we present a reference-aware and duration-robust network (RaDur) for TSD. More specifically, in order to make the network more aware of the reference information, we propose an embedding enhancement module to take into account the mixture audio while generating the embedding, and apply the attention pooling to enhance the features of target sound-related frames and weaken the features of noisy frames. In addition, a duration-robust focal loss is proposed to help model different-duration events. To evaluate our method, we build two TSD datasets based on UrbanSound and Audioset. Extensive experiments show the effectiveness of our methods., Comment: submitted to interspeech2022
Published: 2022
Full Text: View/download PDF

21. Consensus-Guided Keyword Targeting for Video Captioning

Author: Puzhao Ji, Bang Yang, Tong Zhang, and Yuexian Zou
Published: 2022

22. Learning Human-Object Interaction via Interactive Semantic Reasoning

Author: Yuexian Zou, Ge Li, Dongming Yang, and Zhu Li
Subjects: Parsing, Artificial neural network, business.industry, Computer science, Feature extraction, Construct (python library), Object (computer science), Semantics, Machine learning, computer.software_genre, Computer Graphics and Computer-Aided Design, Visualization, Humans, Artificial intelligence, business, Feature learning, computer, Software, Algorithms
Abstract: Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects via inferring triplets of $\langle $ human, verb, object $\rangle $ . Recent HOI detection methods infer HOIs by directly extracting appearance features and spatial configuration from related visual targets of human and object, but neglect powerful interactive semantic reasoning between these targets. Meanwhile, existing spatial encodings of visual targets have been simply concatenated to appearance features, which is unable to dynamically promote the visual feature learning. To solve these problems, we first present a novel semantic-based Interactive Reasoning Block, in which interactive semantics implied among visual targets are efficiently exploited. Beyond inferring HOIs using discrete instance features, we then design a HOI Inferring Structure to parse pairwise interactive semantics among visual targets in scene-wide level and instance-wide level. Furthermore, we propose a Spatial Guidance Model based on the location of human body-parts and object, which serves as a geometric guidance to dynamically enhance the visual feature learning. Based on the above modules, we construct a framework named Interactive-Net for HOI detection, which is fully differentiable and end-to-end trainable. Extensive experiments show that our proposed framework outperforms existing HOI detection methods on both V-COCO and HICO-DET benchmarks and improves the baseline about 5.9% and 17.7% relatively, validating its efficacy in detecting HOIs.
Published: 2021

23. SpecAugment++: A Hidden Space Data Augmentation Method for AcousticScene Classification

Author: Helin Wang, Yuexian Zou, and Wenwu Wang
Abstract: In this paper, we present SpecAugment++, a novel data aug-mentation method for deep neural networks based acousticscene classification (ASC). Different from other popular dataaugmentation methods such as SpecAugment and mixup thatonly work on the input space, SpecAugment++ is applied toboth the input space and the hidden space of the deep neuralnetworks to enhance the input and the intermediate feature rep-resentations. For an intermediate hidden state, the augmentationtechniques consist of masking blocks of frequency channels andmasking blocks of time frames, which improve generalizationby enabling a model to attend not only to the most discrimina-tive parts of the feature, but also the entire parts. Apart fromusing zeros for masking, we also examine two approaches formasking based on the use of other samples within the mini-batch, which helps introduce noises to the networks to makethem more discriminative for classification. The experimentalresults on the DCASE 2018 Task1 dataset and DCASE 2019Task1 dataset show that our proposed method can obtain3.6%and4.7%accuracy gains over a strong baseline without aug-mentation (i.e.CP-ResNet) respectively, and outperforms otherprevious data augmentation methods.
Published: 2021

24. Semantic Transportation Prototypical Network for Few-Shot Intent Detection

Author: Peilin Zhou, Yuexian Zou, Chenyu You, and Weiyuan Xu
Subjects: Computer science, Shot (pellet), business.industry, Computer vision, Artificial intelligence, business
Published: 2021

25. Contextualized Attention-Based Knowledge Transfer for Spoken Conversational Question Answering

Author: Nuo Chen, Chenyu You, and Yuexian Zou
Subjects: FOS: Computer and information sciences, Text corpus, Sound (cs.SD), Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer science, computer.software_genre, Computer Science - Sound, Machine Learning (cs.LG), Computer Science - Information Retrieval, Task (project management), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Question answering, Audio signal processing, Computer Science - Computation and Language, business.industry, Comprehension, Artificial Intelligence (cs.AI), Embedding, Artificial intelligence, business, Computation and Language (cs.CL), Knowledge transfer, computer, Information Retrieval (cs.IR), Natural language processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora. Different from traditional text question answering (QA) tasks, SCQA involves audio signal processing, passage comprehension, and contextual understanding. However, ASR systems introduce unexpected noisy signals to the transcriptions, which result in performance degradation on SCQA. To overcome the problem, we propose CADNet, a novel contextualized attention-based distillation approach, which applies both cross-attention and self-attention to obtain ASR-robust contextualized embedding representations of the passage and dialogue history for performance improvements. We also introduce the spoken conventional knowledge distillation framework to distill the ASR-robust knowledge from the estimated probabilities of the teacher model to the student. We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance in this task.
Published: 2021

26. Self-Supervised Dialogue Learning for Spoken Conversational Question Answering

Author: Nuo Chen, Chenyu You, and Yuexian Zou
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Coreference, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer science, business.industry, Resolution (logic), computer.software_genre, Machine Learning (cs.LG), Artificial Intelligence (cs.AI), Audio and Speech Processing (eess.AS), Order (business), FOS: Electrical engineering, electronic engineering, information engineering, Question answering, Artificial intelligence, Language model, business, Computation and Language (cs.CL), computer, Natural language processing, Coherence (linguistics), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In spoken conversational question answering (SCQA), the answer to the corresponding question is generated by retrieving and then analyzing a fixed spoken document, including multi-part conversations. Most SCQA systems have considered only retrieving information from ordered utterances. However, the sequential order of dialogue is important to build a robust spoken conversational question answering system, and the changes of utterances order may severely result in low-quality and incoherent corpora. To this end, we introduce a self-supervised learning approach, including incoherence discrimination, insertion detection, and question prediction, to explicitly capture the coreference resolution and dialogue coherence among spoken documents. Specifically, we design a joint learning framework where the auxiliary self-supervised tasks can enable the pre-trained SCQA systems towards more coherent and meaningful spoken dialogue learning. We also utilize the proposed self-supervised learning tasks to capture intra-sentence coherence. Experimental results demonstrate that our proposed method provides more coherent, meaningful, and appropriate responses, yielding superior performance gains compared to the original pre-trained language models. Our method achieves state-of-the-art results on the Spoken-CoQA dataset., Comment: To Appear Interspeech 2021
Published: 2021

27. MRD-Net: Multi-Modal Residual Knowledge Distillation for Spoken Question Answering

Author: Yuexian Zou, Nuo Chen, and Chenyu You
Subjects: business.industry, Computer science, Net (mathematics), computer.software_genre, Residual, law.invention, Modal, law, Question answering, Artificial intelligence, business, Distillation, computer, Natural language processing
Abstract: Spoken question answering (SQA) has recently drawn considerable attention in the speech community. It requires systems to find correct answers from the given spoken passages simultaneously. The common SQA systems consist of the automatic speech recognition (ASR) module and text-based question answering module. However, previous methods suffer from severe performance degradation due to ASR errors. To alleviate this problem, this work proposes a novel multi-modal residual knowledge distillation method (MRD-Net), which further distills knowledge at the acoustic level from the audio-assistant (Audio-A). Specifically, we utilize the teacher (T) trained on manual transcriptions to guide the training of the student (S) on ASR transcriptions. We also show that introducing an Audio-A helps this procedure by learning residual errors between T and S. Moreover, we propose a simple yet effective attention mechanism to adaptively leverage audio-text features as the new deep attention knowledge to boost the network performance. Extensive experiments demonstrate that the proposed MRD-Net achieves superior results compared with state-of-the-art methods on three spoken question answering benchmark datasets.
Published: 2021

28. Sentiment Injected Iteratively Co-Interactive Network for Spoken Language Understanding

Author: Yuexian Zou, Fenglin Liu, Zhiqi Huang, and Peilin Zhou
Subjects: Computer science, business.industry, media_common.quotation_subject, Feature extraction, Multi-task learning, computer.software_genre, Task (project management), Benchmark (computing), Conversation, Artificial intelligence, business, Representation (mathematics), computer, Natural language processing, media_common, Spoken language
Abstract: Spoken Language Understanding (SLU) is an essential part of the spoken dialogue system, which typically consists of intent detection (ID) and slot filling (SF) tasks. During the conversation, most utterances of people contain rich sentimental information, which is helpful for performing the ID and SF tasks but ignored to be explored by existing works. In this paper, we argue that implicitly introducing sentimental features can promote SLU performance. Specifically, we present a Multitask Learning (MTL) framework to implicitly extract and utilize the aspect-based sentimental text features. Besides, we introduce an Iteratively Co-Interactive Network (ICN) for the SLU task to fully utilize the comprehensive text features. Experimental results show that with the external BERT representation, our framework achieves new state-of-the-art on two benchmark datasets, i.e., SNIPS and ATIS.
Published: 2021

29. Knowledge Distillation for Improved Accuracy in Spoken Question Answering

Author: Nuo Chen, Yuexian Zou, and Chenyu You
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer science, computer.software_genre, Computer Science - Sound, Machine Learning (cs.LG), Computer Science - Information Retrieval, law.invention, Task (project management), Knowledge extraction, Audio and Speech Processing (eess.AS), law, FOS: Electrical engineering, electronic engineering, information engineering, Question answering, Distillation, Computer Science - Computation and Language, business.industry, Comprehension, Artificial Intelligence (cs.AI), Artificial intelligence, Language model, business, Computation and Language (cs.CL), computer, Information Retrieval (cs.IR), Natural language processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Spoken question answering (SQA) is a challenging task that requires the machine to fully understand the complex spoken documents. Automatic speech recognition (ASR) plays a significant role in the development of QA systems. However, the recent work shows that ASR systems generate highly noisy transcripts, which critically limit the capability of machine comprehension on the SQA task. To address the issue, we present a novel distillation framework. Specifically, we devise a training strategy to perform knowledge distillation (KD) from spoken documents and written counterparts. Our work makes a step towards distilling knowledge from the language model as a supervision signal to lead to better student accuracy by reducing the misalignment between automatic and manual transcriptions. Experiments demonstrate that our approach outperforms several state-of-the-art language models on the Spoken-SQuAD dataset., To appear in ICASSP 2021
Published: 2021

30. FWB-Net: Front White Balance Network for Color Shift Correction in Single Image Dehazing Via Atmospheric Light Estimation

Author: Cong Wang, Yuexian Zou, Yong Xu, and Yan Huang
Subjects: business.industry, Computer science, Scattering, Scalar (physics), Diffuse sky radiation, Color balance, Computer vision, Artificial intelligence, Atmospheric model, business, Net (mathematics), Bearing (navigation), Image (mathematics)
Abstract: In recent years, single image dehazing deep models based on Atmospheric Scattering Model (ASM) have achieved remarkable results. But the dehazing outputs of those models suffer from color shift. Analyzing the ASM model shows that the atmospheric light factor (ALF) is set as a scalar which indicates ALF is constant for whole image. However, for images taken in real-world, the illumination is not uniformly distributed over whole image which brings model mismatch and possibly results in color shift of the deep models using ASM. Bearing this in mind, in this study, first, a new non-homogeneous atmospheric scattering model (NH-ASM) is proposed for improving image modeling of hazy images taken under complex illumination conditions. Second, a new U-Net based front white balance module (FWB-Module) is dedicatedly designed to correct color shift before generating dehazing result via atmospheric light estimation. Third, a new FWB loss is innovatively developed for training FWB-Module, which imposes penalty on color shift. In the end, based on NH-ASM and front white balance technology, an end-to-end CNN-based color-shift-restraining dehazing network is developed, termed as FWB-Net. Experimental results demonstrate the effectiveness and superiority of our proposed FWB-Net for dehazing on both synthetic and real-world images.
Published: 2021

31. Adaptive Bi-Directional Attention: Exploring Multi-Granularity Representations for Machine Reading Comprehension

Author: Nuo Chen, Fenglin Liu, Peilin Zhou, Yuexian Zou, and Chenyu You
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial neural network, Computer Science - Artificial Intelligence, Computer science, business.industry, Machine learning, computer.software_genre, Artificial Intelligence (cs.AI), Knowledge extraction, Encoding (memory), Similarity (psychology), Benchmark (computing), Artificial intelligence, business, Representation (mathematics), Computation and Language (cs.CL), Encoder, computer, Transformer (machine learning model)
Abstract: Recently, the attention-enhanced multi-layer encoder, such as Transformer, has been extensively studied in Machine Reading Comprehension (MRC). To predict the answer, it is common practice to employ a predictor to draw information only from the final encoder layer which generates the \textit{coarse-grained} representations of the source sequences, i.e., passage and question. Previous studies have shown that the representation of source sequence becomes more \textit{coarse-grained} from \textit{fine-grained} as the encoding layer increases. It is generally believed that with the growing number of layers in deep neural networks, the encoding process will gather relevant information for each location increasingly, resulting in more \textit{coarse-grained} representations, which adds the likelihood of similarity to other locations (referring to homogeneity). Such a phenomenon will mislead the model to make wrong judgments so as to degrade the performance. To this end, we propose a novel approach called Adaptive Bidirectional Attention, which adaptively exploits the source representations of different levels to the predictor. Experimental results on the benchmark dataset, SQuAD 2.0 demonstrate the effectiveness of our approach, and the results are better than the previous state-of-the-art model by 2.5$\%$ EM and 2.3$\%$ F1 scores., Comment: five paes, four figures
Published: 2021

32. Contrastive Self-Supervised Learning for Text-Independent Speaker Verification

Author: Helin Wang, Yuexian Zou, and Haoran Zhang
Subjects: Signal processing, Channel (digital image), Exploit, Computer science, business.industry, computer.software_genre, Data modeling, Task (project management), Encoding (memory), Artificial intelligence, Representation (mathematics), business, computer, Utterance, Natural language processing
Abstract: Current speaker verification models rely on supervised training with massive annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing massive unlabeled utterance data, our work exploits a contrastive self-supervised learning (CSSL) approach for text-independent speaker verification task. The core principle of CSSL lies in minimizing the distance between the embeddings of augmented segments truncated from the same utterance as well as maximizing those from different utterances. We proposed channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation. Bearing these in mind, we conduct intensive experiments on VoxCeleb1&2 datasets. The self-supervised thin-ResNet34 fine-tuned with only 5% of the labeled data can achieve comparable performance to the fully supervised model, which is meaningful to economize lots of manual annotation.
Published: 2021

33. CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

Author: Jie Chen, Can Zhang, Meng Cao, Yuexian Zou, and Dongming Yang
Subjects: FOS: Computer and information sciences, COLA (software architecture), Relation (database), business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Interval temporal logic, Feature vector, Computer Science - Computer Vision and Pattern Recognition, Snippet, computer.software_genre, Pattern recognition (psychology), Feature (machine learning), Frame (artificial intelligence), Artificial intelligence, business, computer, Natural language processing
Abstract: Weakly-supervised temporal action localization (WS-TAL) aims to localize actions in untrimmed videos with only video-level labels. Most existing models follow the "localization by classification" procedure: locate temporal regions contributing most to the video-level classification. Generally, they process each snippet (or frame) individually and thus overlook the fruitful temporal context relation. Here arises the single snippet cheating issue: "hard" snippets are too vague to be classified. In this paper, we argue that learning by comparing helps identify these hard snippets and we propose to utilize snippet Contrastive learning to Localize Actions, CoLA for short. Specifically, we propose a Snippet Contrast (SniCo) Loss to refine the hard snippet representation in feature space, which guides the network to perceive precise temporal boundaries and avoid the temporal interval interruption. Besides, since it is infeasible to access frame-level annotations, we introduce a Hard Snippet Mining algorithm to locate the potential hard snippets. Substantial analyses verify that this mining strategy efficaciously captures the hard snippets and SniCo Loss leads to more informative feature representation. Extensive experiments show that CoLA achieves state-of-the-art results on THUMOS'14 and ActivityNet v1.2 datasets. CoLA code is publicly available at https://github.com/zhang-can/CoLA., accepted by CVPR 2021, typos corrected, code link added
Published: 2021

34. Improved Blind Timing Skew Estimation Based on Spectrum Sparsity and ApFFT in Time-Interleaved ADCs

Author: Yuexian Zou, Ning Lyu, Sujuan Liu, and Jiashuai Cui
Subjects: Hardware architecture, Computer science, 020208 electrical & electronic engineering, Fast Fourier transform, Spectral density estimation, 02 engineering and technology, Chip, Gate array, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Field-programmable gate array, Instrumentation, Algorithm, Communication channel
Abstract: Timing skews among channels degrade seriously the time-interleaved analog-to-digital converter (TIADC) performance, which can be improved by the blind timing skew estimation (TSE) technique. In this paper, we proposed the all-phase fast Fourier transform (ApFFT) based on spectrum sparsity signal phase relationship blind TSE (ApFFT-SSPR-BLTSE) algorithm. The ApFFT-SSPR-BLTSE algorithm reduces computational complexity based on the phase relationship of the total output from TIADC and the corresponding reference channel output compared with the existing spectrum sparsity blind TSE (SS-BLTSE) algorithm. We also utilized the ApFFT technique to increase the accuracy of phase spectral estimation. Simulation results show that the proposed ApFFT-SSPR-BLTSE algorithm, which as a reduced number of fast Fourier transforms (FFTs) and low hardware complexity, has higher accuracy for blind TSE compared to the existing SS-BLTSE algorithm. In addition, this paper presents an efficient hardware architecture of the ApFFT-SSPR-BLTSE algorithm on the Xilinx Virtex-6 vlx550tff1759 field-programmable gate array (FPGA) chip for the blind TSE of the four-channel 400-MHz 14-bit TIADC real system. The validation results show that the proposed algorithm uses only a few percent of the hardware resources of the FPGA chip, and the mismatch spurs were suppressed to better than −81.54 dB.
Published: 2019

35. GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection

Author: Yuexian Zou, Chao Liu, Dongming Yang, and Meng Cao
Subjects: General Computer Science, Computer science, multi-oriented text, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, 0202 electrical engineering, electronic engineering, information engineering, General Materials Science, Segmentation, Layer (object-oriented design), 0105 earth and related environmental sciences, segmentation network, Scene text detection, business.industry, General Engineering, Process (computing), Pattern recognition, Object detection, Task (computing), Feature (computer vision), Salient, 020201 artificial intelligence & image processing, contextual attention, lcsh:Electrical engineering. Electronics. Nuclear engineering, Deconvolution, Artificial intelligence, business, lcsh:TK1-9971, gradient vanishing/exploding problems
Abstract: Scene text detection (STD) is an irreplaceable step in a scene text reading system. It remains a more challenging task than general object detection since text objects are of arbitrary orientations and varying sizes. Generally, segmentation methods that use U-Net or hourglass-like networks are the mainstream approaches in multi-oriented text detection tasks. However, experience has shown that text-like objects in the complex background have high response values on the output feature map of U-Net, which leads to the severe false positive detection rate and degrades the STD performance. To tackle this issue, an adaptive soft attention mechanism called contextual attention module (CAM) is devised to integrate into U-Net to highlight salient areas and meanwhile retains more detail information. Besides, the gradient vanishing and exploding problems make U-Net more difficult to train because of the nonlinear deconvolution layer used in the up-sampling process. To facilitate the training process, a gradient-inductive module (GIM) is carefully designed to provide a linear bypass to make the gradient back-propagation process more stable. Accordingly, an end-to-end trainable Gradient-Inductive Segmentation network with Contextual Attention is proposed (GISCA). The experimental results on three public benchmarks have demonstrated that the proposed GISCA achieves the state-of-the-art results in terms of f-measure: 92.1%, 87.3%, and 81.4% for ICDAR 2013, ICDAR 2015, and MSRA TD500, respectively.
Published: 2019

36. Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Author: Yuexian Zou and Bang Yang
Subjects: Closed captioning, Structure (mathematical logic), business.industry, Computer science, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Task (project management), Human–computer interaction, Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Layer (object-oriented design), Joint (audio engineering), business, Encoder, Natural language, 0105 earth and related environmental sciences
Abstract: Video captioning is a challenging task which aims at automatically generating a natural language description of a given video. Recent researches have shown that exploiting the intrinsic multi-modalities of videos significantly promotes captioning performance. However, how to integrate multi-modalities to generate effective semantic representations for video captioning is still an open issue. Some researchers proposed to learn multimodal features in parallel during the encoding stage. The downside of these methods lies in the neglect of the interaction among multi-modalities and their rich contextual information. In this study, inspired by the fact that visual contents are generally more important for comprehending videos, we propose a novel Visual Oriented Encoder (VOE) to integrate multimodal features in an interactive manner. Specifically, VOE is designed as a hierarchical structure, where bottom layers are utilized to extract multi-scale contexts from auxiliary modalities while the top layer is exploited to generate joint representations by considering both visual and contextual information. Following the encoder-decoder framework, we systematically develop a VOE-LSTM model and evaluate it on two mainstream benchmarks: MSVD and MSR-VTT. Experimental results show that the proposed VOE surpasses conventional encoders and our VOE-LSTM model achieves competitive results compared with state-of-the-art approaches.
Published: 2021

37. PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Author: Fenglin Liu, Zhiqi Huang, Peilin Zhou, and Yuexian Zou
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer science, business.industry, Speech recognition, Context (language use), 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Recurrent neural network, 020204 information systems, Pattern recognition (psychology), 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Language model, Artificial intelligence, business, Computation and Language (cs.CL), Information exchange, Utterance, 0105 earth and related environmental sciences, Spoken language
Abstract: Spoken Language Understanding (SLU) is an essential part of the spoken dialogue system, which typically consists of intent detection (ID) and slot filling (SF) tasks. Recently, recurrent neural networks (RNNs) based methods achieved the state-of-the-art for SLU. It is noted that, in the existing RNN-based approaches, ID and SF tasks are often jointly modeled to utilize the correlation information between them. However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF. Motivated by these findings, in this paper, Parallel Interactive Network (PIN) is proposed to model the mutual guidance between ID and SF. Specifically, given an utterance, a Gaussian self-attentive encoder is introduced to generate the context-aware feature embedding of the utterance which is able to capture local context information. Taking the feature embedding of the utterance, Slot2Intent module and Intent2Slot module are developed to capture the bidirectional information flow for ID and SF tasks. Finally, a cooperation mechanism is constructed to fuse the information obtained from Slot2Intent and Intent2Slot modules to further reduce the prediction bias. The experiments on two benchmark datasets, i.e., SNIPS and ATIS, demonstrate the effectiveness of our approach, which achieves a competitive result with state-of-the-art models. More encouragingly, by using the feature embedding of the utterance generated by the pre-trained language model BERT, our method achieves the state-of-the-art among all comparison approaches.
Published: 2021

38. RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection

Author: Dongming Yang, Yuexian Zou, Meng Cao, Jie Chen, and Can Zhang
Subjects: Structure (mathematical logic), FOS: Computer and information sciences, Parsing, Relation (database), Computer science, Computer Vision and Pattern Recognition (cs.CV), Frame (networking), Computer Science - Computer Vision and Pattern Recognition, Inference, Construct (python library), Object (computer science), computer.software_genre, Semantics, Human–computer interaction, computer
Abstract: Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects. Latest end-to-end HOI detectors are short of relation reasoning, which leads to inability to learn HOI-specific interactive semantics for predictions. In this paper, we therefore propose novel relation reasoning for HOI detection. We first present a progressive Relation-aware Frame, which brings a new structure and parameter sharing pattern for interaction inference. Upon the frame, an Interaction Intensifier Module and a Correlation Parsing Module are carefully designed, where: a) interactive semantics from humans can be exploited and passed to objects to intensify interactions, b) interactive correlations among humans, objects and interactions are integrated to promote predictions. Based on modules above, we construct an end-to-end trainable framework named Relation Reasoning Network (abbr. RR-Net). Extensive experiments show that our proposed RR-Net sets a new state-of-the-art on both V-COCO and HICO-DET benchmarks and improves the baseline about 5.5% and 9.8% relatively, validating that this first effort in exploring relation reasoning and integrating interactive semantics has brought obvious improvement for end-to-end HOI detection., Comment: 7 pages, 6 figures
Published: 2021
Full Text: View/download PDF

39. Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation

Author: Fenglin Liu, Yuexian Zou, Shen Ge, Xian Wu, and Wei Fan
Subjects: FOS: Computer and information sciences, Medical knowledge, Computer Science - Computation and Language, Information retrieval, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Missed diagnosis, Visualization, Task (project management), Radiology report, Clinical Practice, Pattern recognition (psychology), Graph (abstract data type), Artificial intelligence, business, Computation and Language (cs.CL)
Abstract: Automatically generating radiology reports can improve current clinical practice in diagnostic radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; On the other hand, it can remind radiologists of abnormalities and avoid the misdiagnosis and missed diagnosis. Yet, this task remains a challenging job for data-driven neural networks, due to the serious visual and textual data biases. To this end, we propose a Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) to imitate the working patterns of radiologists, who will first examine the abnormal regions and assign the disease topic tags to the abnormal regions, and then rely on the years of prior medical knowledge and prior working experience accumulations to write reports. Thus, the PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD). In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias. The explored knowledge is distilled by the MKD to generate the final reports. Evaluated on MIMIC-CXR and IU-Xray datasets, our method is able to outperform previous state-of-the-art models on these two datasets., Comment: Accepted by CVPR 2021 (2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2021))
Published: 2021
Full Text: View/download PDF

40. SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

Author: Helin Wang, Wenwu Wang, and Yuexian Zou
Subjects: Masking (art), FOS: Computer and information sciences, Sound (cs.SD), Computer science, Generalization, business.industry, Pattern recognition, Space (commercial competition), Computer Science - Sound, Discriminative model, Audio and Speech Processing (eess.AS), Feature (machine learning), FOS: Electrical engineering, electronic engineering, information engineering, Deep neural networks, Artificial intelligence, State (computer science), business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC). Different from other popular data augmentation methods such as SpecAugment and mixup that only work on the input space, SpecAugment++ is applied to both the input space and the hidden space of the deep neural networks to enhance the input and the intermediate feature representations. For an intermediate hidden state, the augmentation techniques consist of masking blocks of frequency channels and masking blocks of time frames, which improve generalization by enabling a model to attend not only to the most discriminative parts of the feature, but also the entire parts. Apart from using zeros for masking, we also examine two approaches for masking based on the use of other samples within the minibatch, which helps introduce noises to the networks to make them more discriminative for classification. The experimental results on the DCASE 2018 Task1 dataset and DCASE 2019 Task1 dataset show that our proposed method can obtain 3.6% and 4.7% accuracy gains over a strong baseline without augmentation (i.e. CP-ResNet) respectively, and outperforms other previous data augmentation methods., Comment: Submitted to Interspeech 2021
Published: 2021
Full Text: View/download PDF

41. All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection

Author: Yuexian Zou, Dongming Yang, Meng Cao, and Can Zhang
Subjects: FOS: Computer and information sciences, Point (typography), Channel (digital image), business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, Spotting, Pipeline (software), Expression (mathematics), Feature (computer vision), Media Technology, Segmentation, Rectangle, Artificial intelligence, Electrical and Electronic Engineering, business
Abstract: Arbitrary-shaped text detection is a challenging task since curved texts in the wild are of the complex geometric layouts. Existing mainstream methods follow the instance segmentation pipeline to obtain the text regions. However, arbitraryshaped texts are difficult to be depicted through one single segmentation network because of the varying scales. In this paper, we propose a two-stage segmentation-based detector, termed as NASK (Need A Second looK), for arbitrary-shaped text detection. Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations. Specifically, NASK is composed of a Text Instance Segmentation (TIS) network (1st stage), a Geometry-aware Text RoI Alignment (GeoAlign) module, and a Fiducial pOint eXpression (FOX) module (2nd stage). Firstly, TIS extracts the augmented features with a novel Group Spatial and Channel Attention (GSCA) module and conducts instance segmentation to obtain rectangle proposals. Then, GeoAlign converts these rectangles into the fixed size and encodes RoI-wise feature representation. Finally, FOX disintegrates the text instance into serval pivotal geometrical attributes to refine the detection results. Extensive experimental results on three public benchmarks including Total-Text, SCUTCTW1500, and ICDAR 2015 verify that our NASK outperforms recent state-of-the-art methods., Comment: Accepted by T-CSVT
Published: 2021
Full Text: View/download PDF

42. A Mutual learning framework for Few-shot Sound Event Detection

Author: Dongchao Yang, Helin Wang, Yuexian Zou, Zhongjie Ye, and Wenwu Wang
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Although prototypical network (ProtoNet) has proved to be an effective method for few-shot sound event detection, two problems still exist. Firstly, the small-scaled support set is insufficient so that the class prototypes may not represent the class center accurately. Secondly, the feature extractor is task-agnostic (or class-agnostic): the feature extractor is trained with base-class data and directly applied to unseen-class data. To address these issues, we present a novel mutual learning framework with transductive learning, which aims at iteratively updating the class prototypes and feature extractor. More specifically, we propose to update class prototypes with transductive inference to make the class prototypes as close to the true class center as possible. To make the feature extractor to be task-specific, we propose to use the updated class prototypes to fine-tune the feature extractor. After that, a fine-tuned feature extractor further helps produce better class prototypes. Our method achieves the F-score of 38.4$\%$ on the DCASE 2021 Task 5 evaluation set, which won the first place in the few-shot bioacoustic event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge., Comment: Accepted by ICASSP2022. arXiv admin note: text overlap with arXiv:2106.12252 by other authors
Published: 2021
Full Text: View/download PDF

43. Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification

Author: Dongchao Yang, Helin Wang, and Yuexian Zou
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Discriminator, Relation (database), business.industry, Computer science, Pattern recognition, Computer Science - Sound, Domain (software engineering), Task (project management), Audio and Speech Processing (eess.AS), Face (geometry), FOS: Electrical engineering, electronic engineering, information engineering, Artificial intelligence, Adaptation (computer science), business, Subspace topology, Electrical Engineering and Systems Science - Audio and Speech Processing, Test data
Abstract: It is well known that the mismatch between training (source) and test (target) data distribution will significantly decrease the performance of acoustic scene classification (ASC) systems. To address this issue, domain adaptation (DA) is one solution and many unsupervised DA methods have been proposed. These methods focus on a scenario of single source domain to single target domain. However, we will face such problem that test data comes from multiple target domains. This problem can be addressed by producing one model per target domain, but this solution is too costly. In this paper, we propose a novel unsupervised multi-target domain adaption (MTDA) method for ASC, which can adapt to multiple target domains simultaneously and make use of the underlying relation among multiple domains. Specifically, our approach combines traditional adversarial adaptation with two novel discriminator tasks that learns a common subspace shared by all domains. Furthermore, we propose to divide the target domain into the easy-to-adapt and hard-to-adapt domain, which enables the system to pay more attention to hard-to-adapt domain in training. The experimental results on the DCASE 2020 Task 1-A dataset and the DCASE 2019 Task 1-B dataset show that our proposed method significantly outperforms the previous unsupervised DA methods., Comment: 5pages,4figures,submit to interspeech2021
Published: 2021
Full Text: View/download PDF

44. SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection

Author: Yuexian Zou, Can Zhang, and Ranyu Ning
Subjects: FOS: Computer and information sciences, business.industry, Generalization, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, Convolution, Action (philosophy), Receptive field, Feature (computer vision), Artificial intelligence, business, Set (psychology), Scale (map), Block (data storage)
Abstract: Temporal action detection (TAD) is a challenging task which aims to temporally localize and recognize the human action in untrimmed videos. Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers. Obviously, such an anchor-based TAD method limits its generalization capability and will lead to performance degradation when videos contain rich action variation. In this study, we explore to remove the requirement of pre-defined anchors for TAD methods. A novel TAD model termed as Selective Receptive Field Network (SRF-Net) is developed, in which the location offsets and classification scores at each temporal location can be directly estimated in the feature map and SRF-Net is trained in an end-to-end manner. Innovatively, a building block called Selective Receptive Field Convolution (SRFC) is dedicatedly designed which is able to adaptively adjust its receptive field size according to multiple scales of input information at each temporal location in the feature map. Extensive experiments are conducted on the THUMOS14 dataset, and superior results are reported comparing to state-of-the-art TAD approaches., Comment: Accepted by ICASSP 2021
Published: 2021
Full Text: View/download PDF

45. A Global-local Attention Framework for Weakly Labelled Audio Tagging

Author: Wenwu Wang, Yuexian Zou, and Helin Wang
Subjects: FOS: Computer and information sciences, Signal processing, Network architecture, Sound (cs.SD), Offset (computer science), Exploit, Computer science, Speech recognition, Pooling, Speech processing, Computer Science - Sound, Audio and Speech Processing (eess.AS), Selection (linguistics), FOS: Electrical engineering, electronic engineering, information engineering, Baseline (configuration management), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Weakly labelled audio tagging aims to predict the classes of sound events within an audio clip, where the onset and offset times of the sound events are not provided. Previous works have used the multiple instance learning (MIL) framework, and exploited the information of the whole audio clip by MIL pooling functions. However, the detailed information of sound events such as their durations may not be considered under this framework. To address this issue, we propose a novel two-stream framework for audio tagging by exploiting the global and local information of sound events. The global stream aims to analyze the whole audio clip in order to capture the local clips that need to be attended using a class-wise selection module. These clips are then fed to the local stream to exploit the detailed information for a better decision. Experimental results on the AudioSet show that our proposed method can significantly improve the performance of audio tagging under different baseline network architectures., Comment: Accepted to ICASSP2021
Published: 2021
Full Text: View/download PDF

46. Deep Speaker Embedding with Long Short Term Centroid Learning for Text-Independent Speaker Verification

Author: Yuexian Zou, Junyi Peng, and Rongzhi Gu
Subjects: Speaker verification, Computer science, Speech recognition, Text independent, Embedding, Centroid, Term (time)
Published: 2020

47. Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging

Author: Yuexian Zou, Wenwu Wang, and Sixin Hong
Subjects: Computer science, Head (linguistics), business.industry, Pooling, Computer vision, Artificial intelligence, business
Published: 2020

48. Bridging the Gap between Vision and Language Domains for Improved Image Captioning

Author: Xian Wu, Fenglin Liu, Yuexian Zou, Wei Fan, Xiaoyu Zhang, and Shen Ge
Subjects: Closed captioning, Information retrieval, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, 010501 environmental sciences, Semantics, computer.software_genre, 01 natural sciences, Bridging (programming), Domain (software engineering), Image (mathematics), Range (mathematics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Plug-in, computer, Encoder, 0105 earth and related environmental sciences
Abstract: Image captioning has attracted extensive research interests in recent years. Due to the great disparities between vision and language, an important goal of image captioning is to link the information in visual domain to textual domain. However, many approaches conduct this process only in the decoder, making it hard to understand the images and generate captions effectively. In this paper, we propose to bridge the gap between the vision and language domains in the encoder, by enriching visual information with textual concepts, to achieve deep image understandings. To this end, we propose to explore the textual-enriched image features. Specifically, we introduce two modules, namely Textual Distilling Module and Textual Association Module. The former distills relevant textual concepts from image features, while the latter further associates extracted concepts according to their semantics. In this manner, we acquire textual-enriched image features, which provide clear textual representations of image under no explicit supervision. The proposed approach can be used as a plugin and easily embedded into a wide range of existing image captioning systems. We conduct the extensive experiments on two benchmark image captioning datasets, i.e., MSCOCO and Flickr30k. The experimental results and analysis show that, by incorporating the proposed approach, all baseline models receive consistent improvements over all metrics, with the most significant improvement up to 10% and 9%, in terms of the task-specific metrics CIDEr and SPICE, respectively. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image captioning.
Published: 2020

49. Cluster Attention Contrast for Video Anomaly Detection

Author: Ziming Wang, Zeming Zhang, and Yuexian Zou
Subjects: Computer science, business.industry, media_common.quotation_subject, Contrast (statistics), Pattern recognition, Snippet, Feature (computer vision), Anomaly detection, Artificial intelligence, business, Representation (mathematics), Projection (set theory), Feature learning, Normality, media_common
Abstract: Anomaly detection in videos is commonly referred to as the discrimination of events that do not conform to expected behaviors. Most existing methods formulate video anomaly detection as an outlier detection task and establish normal concept by minimizing reconstruction loss or prediction loss on training data. However, these methods performances suffer drops when they cannot guarantee either higher reconstruction errors for abnormal events or lower prediction errors for normal events. To avoid these problems, we introduce a novel contrastive representation learning task, Cluster Attention Contrast, to establish subcategories of normality as clusters. Specifically, we employ multi-parallel projection layers to project snippet-level video features into multiple discriminate feature spaces. Each of these feature spaces is corresponding to a cluster which captures distinct subcategory of normality, respectively. To acquire the reliable subcategories, we propose the Cluster Attention Module to draw thecluster attention representation of each snippet, then maximize the agreement of the representations from the same snippet under random data augmentations via momentum contrast. In this manner, we establish a robust normal concept without any prior assumptions on reconstruction errors or prediction errors. Experiments show our approach achieves state-of-the-art performance on benchmark datasets.
Published: 2020

50. ABC-NET: Avoiding Blocking Effect & Color Shift Network for Single Image Dehazing Via Restraining Transmission Bias

Author: Yuexian Zou, Cong Wang, and Zehan Chen
Subjects: Transmission (telecommunications), Method comparison, Computer science, Activation function, Blocking effect, Color shift, Single image, Negative bias, Algorithm, Block effect
Abstract: In recent years, single image dehazing methods based on Atmospheric Scattering Model (ASM) have achieved state-of-the-art results. But the dehazing outputs of those methods suffer from color shift and blocking effect. Our preliminary experiments show that the negative bias of the estimated transmission and the bias of tiny transmission value will cause serious color shift. Therefore, in this study, a new loss function (TransLoss) and a new natural activation function (NAF) are proposed to restrain negative bias of transmission and avoid tiny transmission value from being activated, respectively. Moreover, it is noted that the block effect is caused by patch-level transmission estimation mechanism in existing dehazing models. To address this issue, a new pixel-level transmission estimation module (ETM) is dedicated designed to avoid blocking effect. In the end, an end-to-end CNN dehazing network avoiding color shift and blocking effect is developed, termed as ABC-Net. Experimental results indicate that the ABC-Net outperforms four comparison methods on both synthetic and real-world images.
Published: 2020

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Journal

Database

Publisher

185 results on '"Yuexian Zou"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources