Start Over

Urdu Short Paraphrase Detection at Sentence Level

Authors :: Hamza Hafeez
Iqra Muneer
Muhammad Sharjeel
Muhammad Adnan Ashraf
Rao Muhammad Adeel Nawab
Source :: ACM Transactions on Asian and Low-Resource Language Information Processing. 22:1-20
Publication Year :: 2023
Publisher :: Association for Computing Machinery (ACM), 2023.
Abstract: Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique ( F 1 = 0.855). Our corpus is available and free to download for research purposes.

Subjects :: General Computer Science

Details

ISSN :: 23754702 and 23754699
Volume :: 22
Database :: OpenAIRE
Journal :: ACM Transactions on Asian and Low-Resource Language Information Processing
Accession number :: edsair.doi...........072eedabac0f7215e8d6134e2fbf4329
Full Text :: https://doi.org/10.1145/3586009