1. Spice+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models
- Author
-
Gontier, Félix, Serizel, Romain, Cerisara, Christophe, Serizel, Romain, APPEL À PROJETS GÉNÉRIQUE 2018 - Apprentissage statistique pour la compréhension de scènes audio - - LEAUDS2018 - ANR-18-CE23-0020 - AAPG2018 - VALID, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Natural Language Processing : representations, inference and semantics (SYNALP), Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018)
- Subjects
Audio captioning, Evaluation, DCASE ,[INFO]Computer Science [cs] ,[INFO] Computer Science [cs] - Abstract
Audio captioning aims at describing acoustic scenes with natural language. Systems are currently evaluated by image cap-tioning metrics CIDEr and SPICE. However, recent studies have highlighted a poor correlation of these metrics with human assessments. In this paper, we propose SPICE+, a modification of SPICE that improves caption annotation and comparison with pre-trained language models. The metric parses captions to semantic graphs with a deep dependency annotation model and a refined set of linguistic rules, then compares sentence embeddings of candidate and reference semantic elements. We formulate a score for general-purpose captioning evaluation, that can be tailored to more specific applications. Combined with fluency error detection, the metric achieves competitive performance on the FENSE benchmark, with 84.0% accuracy on AudioCaps and 74.1% on Clotho.Further experiments show that the metric behaves similarly to the full sentence embedding similarity, while the decomposition into semantic elements allows better interpretability of scores and can provide additional information on the properties of captioning systems.
- Published
- 2023
- Full Text
- View/download PDF