Back to Search
Start Over
An Efficient Text Classification Using fastText for Bahasa Indonesia Documents Classification
- Source :
- 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA).
- Publication Year :
- 2020
- Publisher :
- IEEE, 2020.
-
Abstract
- Text classification using a simple word representation with a linear classifier often considered as strong baselines to gain the best performances. However, a simple word representation like Bag of Word (BOW) has a deficiency of curse dimensionality, so it is only suitable for small datasets. BOW also needs some dependent pre-processing steps like stopwords-removal and stemming. Therefore, the BOW model cannot be implemented automatically because of the dependency in a specific language. On the other hand, deep neural network classifiers can eliminate the pre-processing prerequisite, but this model not efficient in time processing and need a large dataset for the learning process. It becomes a challenge for language that has limitation resources like Bahasa Indonesia. Another novel approach of text classifier is using the fastText model for text classification. This model can minimize pre-processing dependencies and more efficient in training time processing. However, there hasn't been much observation whether the fastText model outperformed the BOW model for small datasets. This paper aims to compare text classification using the TFIDF model as one of the BOW models with a fastText model for 500 news articles in Bahasa Indonesia. The result of this study showed both models gain an outstanding performance, which is 0.97 F-Score. The TFIDF model needs longer pre-processing stages and requiring more training time. Meanwhile, the fastText model only needs to tune some hyperparameters and get similar performance results to the TFIDF model. Based on this study, we can conclude that the fastText model is efficient text classification.
- Subjects :
- 0301 basic medicine
Artificial neural network
business.industry
Computer science
05 social sciences
Linear classifier
Semantics
Machine learning
computer.software_genre
050105 experimental psychology
03 medical and health sciences
Statistical classification
030104 developmental biology
Classifier (linguistics)
0501 psychology and cognitive sciences
Artificial intelligence
business
tf–idf
computer
Word (computer architecture)
Curse of dimensionality
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA)
- Accession number :
- edsair.doi...........7e80645f474cf2cb338292e7961959f5
- Full Text :
- https://doi.org/10.1109/databia50434.2020.9190447