Back to Search
Start Over
A New Method to Cluster HTML Documents Using Mixed Algorithms
- Source :
- مطالعات مدیریت کسب و کار هوشمند, Vol 6, Iss 24, Pp 37-62 (2018)
- Publication Year :
- 2018
- Publisher :
- Allameh Tabataba'i University Press, 2018.
-
Abstract
- Given the high volume of web information, more attention has been paid to the automatic data extraction systems. One of the most important methods of data extraction is clustering. Today, many clustering methods are provided which are mostly based on vector models. In these models, each document is treated like a set of words, and the sequence of words in the sentence is ignored. Since the meanings in the natural language are completely dependent on the sequence of words, a great deal of shortcomings is observed in these methods. To overcome these shortcomings, this paper presents a new method for clustering HTML documents in which STC algorithm is considered for clustering snippets. This method, called clustering based on KS_STC key sentences, provides a weighted vector for each document and using this vector, the key sentences of each text are extracted from the document. Finally, these key sentences are given for clustering to the STC algorithm.
Details
- Language :
- Persian
- ISSN :
- 28210964 and 28210816
- Volume :
- 6
- Issue :
- 24
- Database :
- Directory of Open Access Journals
- Journal :
- مطالعات مدیریت کسب و کار هوشمند
- Publication Type :
- Academic Journal
- Accession number :
- edsdoj.34a0892ad31d4e798e160998b9e36c72
- Document Type :
- article
- Full Text :
- https://doi.org/10.22054/ims.2018.8891