Back to Search
Start Over
Urdu Short Paraphrase Detection at Sentence Level
- Source :
- ACM Transactions on Asian and Low-Resource Language Information Processing. 22:1-20
- Publication Year :
- 2023
- Publisher :
- Association for Computing Machinery (ACM), 2023.
-
Abstract
- Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique ( F 1 = 0.855). Our corpus is available and free to download for research purposes.
- Subjects :
- General Computer Science
Subjects
Details
- ISSN :
- 23754702 and 23754699
- Volume :
- 22
- Database :
- OpenAIRE
- Journal :
- ACM Transactions on Asian and Low-Resource Language Information Processing
- Accession number :
- edsair.doi...........072eedabac0f7215e8d6134e2fbf4329
- Full Text :
- https://doi.org/10.1145/3586009