Back to Search Start Over

HAAC: Hierarchical audio augmentation chain for ACCDOA described sound event localization and detection.

Authors :
Wu, Shichao
Wang, Yongru
Hu, Zhengxi
Liu, Jingtai
Source :
Applied Acoustics. Aug2023, Vol. 211, pN.PAG-N.PAG. 1p.
Publication Year :
2023

Abstract

• Propose one hierarchical audio augmentation chain (HAAC) for the ACCDOA-represented SELD. • Generate simulated audio mixtures for SELD and make all synthesis details disclosure. • Conduct SELD experiments with two baseline systems on two benchmark datasets. • Both HAAC and more synthesized simulation audio are helpful to improve the SELD performance. The goal of sound event localization and detection (SELD) is to detect the temporal occurrence activity of a known set of sound events and locate them in the spatial space. We argue that acquiring a large audio dataset is essential for one deep neural network-based SELD system learned as one supervised task. Nonetheless, gathering and annotating such datasets is a costly and time-intensive process. Hence, various data augmentation methods have attracted attention as a solution to increase sample diversity from the limited collections. In this paper, we propose to augment the limited audio samples for the deep neural network-based SELD system in two ways. One is the hierarchical audio augmentation chain (HAAC) proposed for the activity-coupled Cartesian direction of arrival output representation (ACCDOA) described SELD task. It consists of three waveform and spectrogram augmentation techniques, which are exquisitely assembled from the feature map augmentation to audio channel swapping, and finally sample mixup. Second, we propose to augment the training samples by generating more simulated audio samples and making the selected sound events list publicly available to the community. Experiments on the STARSS22 dataset showed that our HAAC audio augmentation chain greatly improved the SELD performance, which increased the sound event detection score by 24% and decreased the localization error by 12.1°. We demonstrate it's one simple yet effective approach, compared to other data augmentation methods. Moreover, with more simulated audio samples, generated by convolving selected sound events with SRIRs, used for training, the SELD performance was improved greatly. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
0003682X
Volume :
211
Database :
Academic Search Index
Journal :
Applied Acoustics
Publication Type :
Academic Journal
Accession number :
170745387
Full Text :
https://doi.org/10.1016/j.apacoust.2023.109541