Back to Search
Start Over
Automated Generation of Human-readable Natural Arabic Text from RDF Data
- Source :
- ACM Transactions on Asian and Low-Resource Language Information Processing. 22:1-13
- Publication Year :
- 2023
- Publisher :
- Association for Computing Machinery (ACM), 2023.
-
Abstract
- With the advances in Natural Language Processing (NLP), the industry has been moving towards human-directed artificial intelligence (AI) solutions. Recently, chatbots and automated news generation have captured a lot of attention. The goal is to automatically generate readable text from tabular data or web data commonly represented in Resource Description Framework (RDF) format. The problem can then be formulated as Data-to-text (D2T) generation from structured non-linguistic data into human-readable natural language. Despite the significant work done for the English language, no efforts are being directed towards low-resource languages like the Arabic language. This work promotes the development of the first RDF data-to-text (D2T) generation system for the Arabic language while trying to address the low-resource limitation. We develop several models for the Arabic D2T task using transfer learning from large language models (LLM) such as AraBERT, AraGPT2, and mT5. These models include a baseline Bi-LSTM Sequence-to-Sequence (Seq2Seq) model, as well as encoder-decoder transformers like BERT2BERT, BERT2GPT, and T5. We then provide a detailed comparative study highlighting the strengths and limitations of these methods setting the stage for further advancement in the field. We also introduce a new Arabic dataset (AraWebNLG) that can be used for new model development in the field. To ensure a comprehensive evaluation, general-purpose automated metrics (BLEU and Perplexity scores) are used as well as task-specific human evaluation metrics related to the accuracy of the content selection and fluency of the generated text. The results highlight the importance of pre-training on a large corpus of Arabic data and show that transfer learning from AraBERT gives the best performance. Text-to-text pre-training using mT5 achieves second best performance results even with multilingual weights.
- Subjects :
- General Computer Science
Subjects
Details
- ISSN :
- 23754702 and 23754699
- Volume :
- 22
- Database :
- OpenAIRE
- Journal :
- ACM Transactions on Asian and Low-Resource Language Information Processing
- Accession number :
- edsair.doi...........3234b4f6a49546741aa3069089e1072b
- Full Text :
- https://doi.org/10.1145/3582262