1. Improving binary classification of web pages using an ensemble of feature selection algorithms
- Author
-
Matteo Lombardi, Vladimir Estivill-Castro, and Alessandro Marani
- Subjects
Binary classification ,Computer science ,020204 information systems ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Feature selection ,02 engineering and technology ,Feature set ,Algorithm ,Merge (version control) - Abstract
A well-known method to produce accurate predictive models is to apply algorithms for feature selection and feature reduction. These algorithms describe an item with a subset of its attributes that is expected to be the smallest possible without compromising the actual representation of the object, and consequently the entire classification. However, different feature-selection algorithms have different potentially complementary properties each only collecting some aspects of the feature set. Hence the resulting subset of attributes may significantly vary from one feature-selection approach to another. Each method has different effects on the accuracy of the classification. In this contribution, we combine feature-selection algorithms with the intention of recognising the purpose of a web-page. That is, we propose a framework for building an ensemble of feature selection algorithms, to merge their outcomes into a single score and thus achieving a comprehensive analysis of the feature set. We evaluated our proposal against traditional feature-selection and feature-reduction algorithms in a binary classification task of web pages. Our dataset consists of more than 400 pages labelled by educators as either relevant or not relevant for teaching purposes. We also evaluate the impact of the combination across several classifiers. Our results show that our framework outperforms current algorithms, allowing for a much faster and yet reliable classification of web pages in all the different scenarios tested. We expect that our findings will contribute to improving the performance of web classifiers, particularly when running on-the-fly and in real-time.
- Published
- 2018
- Full Text
- View/download PDF