Back to Search Start Over

MegaWika: Millions of reports and their sources across 50 diverse languages

Authors :
Barham, Samuel
Weller, Orion
Yuan, Michelle
Murray, Kenton
Yarmohammadi, Mahsa
Jiang, Zhengping
Vashishtha, Siddharth
Martin, Alexander
Liu, Anqi
White, Aaron Steven
Boyd-Graber, Jordan
Van Durme, Benjamin
Publication Year :
2023

Abstract

To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.<br />Comment: Submitted to ACL, 2023

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2307.07049
Document Type :
Working Paper