Back to Search
Start Over
Experiment on Methods for Clustering and Categorization of Polish Text
- Source :
- BASE-Bielefeld Academic Search Engine, COMPUTING AND INFORMATICS; Vol. 36 No. 1 (2017): Computing and Informatics; 186-204
- Publication Year :
- 2017
- Publisher :
- Central Library of the Slovak Academy of Sciences, 2017.
-
Abstract
- The main goal of this work was to experimentally verify the methods for a challenging task of categorization and clustering Polish text. Supervised and unsupervised learning was employed respectively for the categorization and clustering. A profound examination of the employed methods was done for the custom-built corpus of Polish texts. The corpus was assembled by the authors from Internet resources. The corpus data was acquired from the news portal and, therefore, it was sorted by type by journalists according to their specialization. The presented algorithms employ Vector Space Model (VSM) and TF-IDF (Term Frequency-Inverse Document Frequency) weighing scheme. Series of experiments were conducted that revealed certain properties of algorithms and their accuracy. The accuracy of algorithms was elaborated regarding their ability to match human arrangement of the documents by the topic. For both the categorization and clustering, the authors used F-measure to assess the quality of allocation.
- Subjects :
- Scheme (programming language)
Computer Networks and Communications
Computer science
Conceptual clustering
computer.software_genre
Machine learning
Polish text
Task (project management)
Cluster analysis
tf–idf
computer.programming_language
VSM
business.industry
TF-IDF
categorization
ComputingMethodologies_PATTERNRECOGNITION
Computational Theory and Mathematics
Categorization
Hardware and Architecture
Vector space model
Unsupervised learning
Artificial intelligence
business
computer
Software
Natural language processing
clustering
Subjects
Details
- ISSN :
- 13359150 and 25858807
- Volume :
- 36
- Database :
- OpenAIRE
- Journal :
- Computing and Informatics
- Accession number :
- edsair.doi.dedup.....4d5f4297452199e549f0820cb27e5d15