201. On Continent and Script-Wise Divisions-Based Statistical Measures for Stop-words Lists of International Languages
- Author
-
Rajnish M. Rakholia and Jatinderkumar R. Saini
- Subjects
Kanji ,Arabic ,Computer science ,020209 energy ,02 engineering and technology ,Natural Language Processing (NLP) ,computer.software_genre ,Script ,0202 electrical engineering, electronic engineering, information engineering ,General Environmental Science ,Hangul ,Language ,East Asian languages ,Stop words ,Armenian ,business.industry ,Kana ,Function-Words ,language.human_language ,Bengali ,Devanagari ,Stop-Words ,language ,General Earth and Planetary Sciences ,020201 artificial intelligence & image processing ,Artificial intelligence ,Marathi ,business ,computer ,Natural language processing - Abstract
The data for the current research work was collected for 42 different International languages encompassing 3 continents viz. Asia, Europe and South America. The data comprised of unigram model representation of lexicons in the stop-words lists. 13 scripting systems comprising Arabic, Armenian, Bengali, Chinese, Cyrillic, Devanagari, Greek, Gurmukhi, Hanja & Hangul, Kana, Kanji, Marathi, Roman (Latin) and Thai were considered. Based on a comprehensive analysis of statistical measures for Stop-words lists, it has been concluded that Asian languages are mostly self-scripted and that the average number of stop-words in Asian languages is more than those in European languages. In addition to various important and other first research results, a very important inference from the current research work is that the average number of stop-words for any given language could be predicted to be 200.
- Published
- 2016
- Full Text
- View/download PDF