1. A data sampling and attribute selection strategy for improving decision tree construction
- Author
-
Didier Rémond, Wajdi Dhifli, Nour El Islem Karabadji, Hassina Seridi, Sabeur Aridhi, Ilyes Khelf, Laboratoire de Gestion Electronique de Document [Annaba] (LabGED), Université Badji Mokhtar Annaba (UBMA), Ecole Supérieure de Technologies Industrielles Annaba, Laboratoire de Mécanique des Contacts et des Structures [Villeurbanne] (LaMCoS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Vibrations Acoustique (LVA), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA), Computational Algorithms for Protein Structures and Interactions (CAPSID), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Complex Systems, Artificial Intelligence & Robotics (LORIA - AIS), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Université de Lille, Laboratoire de gestion electronique de documents [Annaba] (LabGED), Université Badji Mokhtar - Annaba [Annaba] (UBMA), Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
- Subjects
0209 industrial biotechnology ,Computer science ,Decision tree ,Feature selection ,02 engineering and technology ,Variation (game tree) ,Residual ,computer.software_genre ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,020901 industrial engineering & automation ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Sampling ,Fault diagnosis ,[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB] ,Particle swarm optimization ,General Engineering ,Condition monitoring ,Sampling (statistics) ,Attribute selection ,Computer Science Applications ,Instantaneous angular seed ,020201 artificial intelligence & image processing ,Noise (video) ,Data mining ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,computer - Abstract
International audience; Decision trees are efficient means for building classification models due to the compressibility, simplicity and ease of interpretation of their results. However, during the construction phase of decision trees, the outputs are often large trees that are affected by many uncertainties in the data (particularity, noise and residual variation). Combining attribute selection and data sampling presents one of the most promising research directions to overcome decision tree construction problems. However, the search space composed of all possible combinations of subsets of training samples and attributes is extremely large. In this paper, a novel approach is presented that allows generating an optimized decision tree by selecting an optimal couple of training samples and attributes subsets for training. As the search space of candidate couples of training samples and attributes subsets is extremely large, we use particle swarm optimization to make the search of an “optimal” solution tractable. The selected optimized solution helps in avoiding over-fitting and complexity problems suffered in the construction phase of decision trees. We conducted an extensive experimental evaluation on 22 datasets from the UCI Machine Learning Repository. The obtained results show that the proposed approach outperforms state-of-the-art classical as well as evolutionary decision tree construction methods in terms of simplicity, accuracy, and F-measure. We further evaluate our approach on a real-world engineering application for condition monitoring of rotating machinery under severe non-stationary conditions. The obtained results showed that the proposed approach allowed to optimize the use of instantaneous angular speed to diagnose gears defects.
- Published
- 2019
- Full Text
- View/download PDF