Author: "Xu, Kele" / Topic: computer science - sound - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xu, Kele"' showing total 8 results

Start Over Author "Xu, Kele" Topic computer science - sound

8 results on '"Xu, Kele"'

1. Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

Author: Chen, Wuyang, Sun, Yanjie, Xu, Kele, and Dou, Yong
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The innate correlation between a person's face and voice has recently emerged as a compelling area of study, especially within the context of multilingual environments. This paper introduces our novel solution to the Face-Voice Association in Multilingual Environments (FAME) 2024 challenge, focusing on a contrastive learning-based chaining-cluster method to enhance face-voice association. This task involves the challenges of building biometric relations between auditory and visual modality cues and modelling the prosody interdependence between different languages while addressing both intrinsic and extrinsic variability present in the data. To handle these non-trivial challenges, our method employs supervised cross-contrastive (SCC) learning to establish robust associations between voices and faces in multi-language scenarios. Following this, we have specifically designed a chaining-cluster-based post-processing step to mitigate the impact of outliers often found in unconstrained in the wild data. We conducted extensive experiments to investigate the impact of language on face-voice association. The overall results were evaluated on the FAME public evaluation platform, where we achieved 2nd place. The results demonstrate the superior performance of our method, and we validate the robustness and effectiveness of our proposed approach. Code is available at https://github.com/colaudiolab/FAME24_solution.
Published: 2024

2. Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

Author: Zhu, Boqing, Xu, Kele, Wang, Changjian, Qin, Zheng, Sun, Tao, Wang, Huaimin, and Peng, Yuxing
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, by comparing the similarities of cross-modal instances from that of cross-modal prototypes, we dynamically recalibrate the unlearnable instances' contribution to overall loss. Experiments show that the proposed approach outperforms state-of-the-art unsupervised methods on various voice-face association evaluation protocols. Additionally, in the low-shot supervision setting, our method also has a significant improvement compared to previous instance-wise contrastive learning., Comment: 8 pages, 4 figures. Accepted by IJCAI-2022
Published: 2022

3. Quantification of Transducer Misalignment in Ultrasound Tongue Imaging

Author: Csapó, Tamás Gábor and Xu, Kele
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In speech production research, different imaging modalities have been employed to obtain accurate information about the movement and shaping of the vocal tract. Ultrasound is an affordable and non-invasive imaging modality with relatively high temporal and spatial resolution to study the dynamic behavior of tongue during speech production. However, a long-standing problem for ultrasound tongue imaging is the transducer misalignment during longer data recording sessions. In this paper, we propose a simple, yet effective, misalignment quantification approach. The analysis employs MSE distance and two similarity measurement metrics to identify the relative displacement between the chin and the transducer. We visualize these measures as a function of the timestamp of the utterances. Extensive experiments are conducted on a Hungarian and Scottish English child dataset. The results suggest that large values of Mean Square Error (MSE) and small values of Structural Similarity Index (SSIM) and Complex Wavelet SSIM indicate corruptions or issues during the data recordings, which can either be caused by transducer misalignment or lack of gel., Comment: 5 pages, accepted for publication at Interspeech 2020
Published: 2020

4. Audio Tagging by Cross Filtering Noisy Labels

Author: Zhu, Boqing, Xu, Kele, Kong, Qiuqiang, Wang, Huaimin, and Peng, Yuxing
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Statistics - Machine Learning
Abstract: High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In this paper, we present a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging. Multiple representations (such as, Logmel and MFCC) are used as the input of our framework for providing more complementary information of the audio. Then, though the cooperation and interaction of two neural networks, we divide the dataset into curated and noisy subsets by incrementally pick out the possibly correctly labeled data from the noisy data. Moreover, our approach leverages the multi-task learning on curated and noisy subsets with different loss function to fully utilize the entire dataset. The noisy-robust loss function is employed to alleviate the adverse effects of incorrect labels. On both the audio tagging datasets FSDKaggle2018 and FSDKaggle2019, empirical results demonstrate the performance improvement compared with other competing approaches. On FSDKaggle2018 dataset, our method achieves state-of-the-art performance and even surpasses the ensemble models., Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing
Published: 2020
Full Text: View/download PDF

5. Multi-Representation Knowledge Distillation For Audio Classification

Author: Gao, Liang, Xu, Kele, Wang, Huaimin, and Peng, Yuxing
Subjects: Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: As an important component of multimedia analysis tasks, audio classification aims to discriminate between different audio signal types and has received intensive attention due to its wide applications. Generally speaking, the raw signal can be transformed into various representations (such as Short Time Fourier Transform and Mel Frequency Cepstral Coefficients), and information implied in different representations can be complementary. Ensembling the models trained on different representations can greatly boost the classification performance, however, making inference using a large number of models is cumbersome and computationally expensive. In this paper, we propose a novel end-to-end collaborative learning framework for the audio classification task. The framework takes multiple representations as the input to train the models in parallel. The complementary information provided by different representations is shared by knowledge distillation. Consequently, the performance of each model can be significantly promoted without increasing the computational overhead in the inference stage. Extensive experimental results demonstrate that the proposed approach can improve the classification performance and achieve state-of-the-art results on both acoustic scene classification tasks and general audio tagging tasks.
Published: 2020

6. Weakly supervised CRNN system for sound event detection with large-scale unlabeled in-domain data

Author: Wang, Dezhi, Zhang, Lilun, Bao, Changchun, Xu, Kele, Zhu, Boqing, and Kong, Qiuqiang
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sound event detection (SED) is typically posed as a supervised learning problem requiring training data with strong temporal labels of sound events. However, the production of datasets with strong labels normally requires unaffordable labor cost. It limits the practical application of supervised SED methods. The recent advances in SED approaches focuses on detecting sound events by taking advantages of weakly labeled or unlabeled training data. In this paper, we propose a joint framework to solve the SED task using large-scale unlabeled in-domain data. In particular, a state-of-the-art general audio tagging model is first employed to predict weak labels for unlabeled data. On the other hand, a weakly supervised architecture based on the convolutional recurrent neural network (CRNN) is developed to solve the strong annotations of sound events with the aid of the unlabeled data with predicted labels. It is found that the SED performance generally increases as more unlabeled data is added into the training. To address the noisy label problem of unlabeled data, an ensemble strategy is applied to increase the system robustness. The proposed system is evaluated on the SED dataset of DCASE 2018 challenge. It reaches a F1-score of 21.0%, resulting in an improvement of 10% over the baseline system., Comment: Submitted to ICASSP 2019
Published: 2018

7. Sample Mixed-Based Data Augmentation for Domestic Audio Tagging

Author: Wei, Shengyun, Xu, Kele, Wang, Dezhi, Liao, Feifan, Wang, Huaimin, and Kong, Qiuqiang
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio tagging has attracted increasing attention since last decade and has various potential applications in many fields. The objective of audio tagging is to predict the labels of an audio clip. Recently deep learning methods have been applied to audio tagging and have achieved state-of-the-art performance, which provides a poor generalization ability on new data. However due to the limited size of audio tagging data such as DCASE data, the trained models tend to result in overfitting of the network. Previous data augmentation methods such as pitch shifting, time stretching and adding background noise do not show much improvement in audio tagging. In this paper, we explore the sample mixed data augmentation for the domestic audio tagging task, including mixup, SamplePairing and extrapolation. We apply a convolutional recurrent neural network (CRNN) with attention module with log-scaled mel spectrum as a baseline system. In our experiments, we achieve an state-of-the-art of equal error rate (EER) of 0.10 on DCASE 2016 task4 dataset with mixup approach, outperforming the baseline system without data augmentation., Comment: submitted to the workshop of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE 2018), 19-20 November 2018, Surrey, UK
Published: 2018

8. Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Author: Zhu, Boqing, Xu, Kele, Wang, Dezhi, Zhang, Lilun, Li, Bo, and Peng, Yuxing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Motivated by the fact that characteristics of different sound classes are highly diverse in different temporal scales and hierarchical levels, a novel deep convolutional neural network (CNN) architecture is proposed for the environmental sound classification task. This network architecture takes raw waveforms as input, and a set of separated parallel CNNs are utilized with different convolutional filter sizes and strides, in order to learn feature representations with multi-temporal resolutions. On the other hand, the proposed architecture also aggregates hierarchical features from multi-level CNN layers for classification using direct connections between convolutional layers, which is beyond the typical single-level CNN features employed by the majority of previous studies. This network architecture also improves the flow of information and avoids vanishing gradient problem. The combination of multi-level features boosts the classification performance significantly. Comparative experiments are conducted on two datasets: the environmental sound classification dataset (ESC-50), and DCASE 2017 audio scene classification dataset. Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single-level features., Comment: Submit to PCM 2018
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

8 results on '"Xu, Kele"'

1. Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

2. Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

3. Quantification of Transducer Misalignment in Ultrasound Tongue Imaging

4. Audio Tagging by Cross Filtering Noisy Labels

5. Multi-Representation Knowledge Distillation For Audio Classification

6. Weakly supervised CRNN system for sound event detection with large-scale unlabeled in-domain data

7. Sample Mixed-Based Data Augmentation for Domestic Audio Tagging

8. Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

8 results on '"Xu, Kele"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources