Back to Search Start Over

PublicHearingBR: A Brazilian Portuguese Dataset of Public Hearing Transcripts for Summarization of Long Documents

Authors :
Fernandes, Leandro CarĂ­sio
Dobins, Guilherme Zeferino Rodrigues
Lotufo, Roberto
Pereira, Jayr Alencar
Publication Year :
2024

Abstract

This paper introduces PublicHearingBR, a Brazilian Portuguese dataset designed for summarizing long documents. The dataset consists of transcripts of public hearings held by the Brazilian Chamber of Deputies, paired with news articles and structured summaries containing the individuals participating in the hearing and their statements or opinions. The dataset supports the development and evaluation of long document summarization systems in Portuguese. Our contributions include the dataset, a hybrid summarization system to establish a baseline for future studies, and a discussion on evaluation metrics for summarization involving large language models, addressing the challenge of hallucination in the generated summaries. As a result of this discussion, the dataset also provides annotated data that can be used in Natural Language Inference tasks in Portuguese.<br />Comment: 26 pages

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2410.07495
Document Type :
Working Paper