Back to Search Start Over

Building the Pornography Corpus for Bahasa Indonesia Based on TRUST+™ Positif Database

Authors :
Ainul Hizriadi
Dani Gunawan
Syaiful Anwar Husen Lubis
Romi Fadillah Rahmat
Source :
2019 International Conference on ICT for Smart Society (ICISS).
Publication Year :
2019
Publisher :
IEEE, 2019.

Abstract

The Indonesian government has developed a database called TRUST+™ Positif which contains the list of blacklisted URLs. These URLs are blacklisted because they contain some prohibited materials or negative contents such as pornography, radicalism, fraud, racism, violence, gambling, and security threat. The government requires all the Internet Service Provider (ISP) in Indonesia to block Internet access to websites listed on TRUST+™ Positif database. The government expects this action will help reducing the spread of negative contents, especially pornography. One of the many ways to disseminate pornographic content is by publishing those articles in the websites. This research purpose is to provide the pornographic corpus in order to be the reference for pornography identification research. The corpus is built from 1,000 articles from 150 websites based on TRUST+™ Positif database. This research only extracts the sentences that related to pornography. The extraction is done based on the selected keywords. These keywords are generated by the most frequent words the articles and should be related to pornography. There are 447 keywords that have been selected manually. The result of this research is a pornographic corpus in Bahasa Indonesia that consists of 35,753 sentences.

Details

Database :
OpenAIRE
Journal :
2019 International Conference on ICT for Smart Society (ICISS)
Accession number :
edsair.doi...........0a19f6b395c915c62fe9cfbfe823e5eb
Full Text :
https://doi.org/10.1109/iciss48059.2019.8969831