Katie Wilkins, Mehedi Hassan, Margherita Francescatto, Jakob Jespersen, R. Gonzalo Parra, Bart Cuypers, Dan DeBlasio, Alexander Junge, Anupama Jigisha, Farzana Rahman, Griet Laenen, Sander Willems, Lieven Thorrez, Yves Moreau, Nagarajan Raju, Sonia Pankaj Chothani, C. Ramakrishnan, Masakazu Sekijima, M. Michael Gromiha, Paddy J Slator, Nigel J Burroughs, Przemysław Szałaj, Zhonghui Tang, Paul Michalski, Oskar Luo, Xingwang Li, Yijun Ruan, Dariusz Plewczynski, Giulia Fiscon, Emanuel Weitschek, Massimo Ciccozzi, Paola Bertolazzi, Giovanni Felici, Pieter Meysman, Manu Vanaerschot, Maya Berg, Hideo Imamura, Jean-Claude Dujardin, Kris Laukens, Westa Domanova, James R. Krycer, Rima Chaudhuri, Pengyi Yang, Fatemeh Vafaee, Daniel J. Fazakerley, Sean J. Humphrey, David E. James, and Zdenka Kuncic
A novel feature selection method to extract multiple adjacent solutions for viral genomic sequences classification Background Leveraging improvements of next generation technologies, genome sequencing of several samples in different conditions led to an exponential growth of biological sequences. However, these collections are not easily treatable by biologists to obtain a thorough data characterization and require a high cost-time investment. Therefore, computing strategies and specifically automatic knowledge extraction methods that optimize the analysis focusing on what data are meaningful and should be sequenced are essential [1]. Methods Here, we present a new feature-selection algorithm based on mixed integer programming methods [2] able to extract multiple and adjacent solutions for supervised learning problems applied to biological data. We focus on those problems where the relative position of a feature (i.e., nucleotide locus) is relevant. In particular, we aim to find sets of distinctive features, which are as close as possible to each other and which appear with the same required characteristics. Our algorithm adopts a fast and effective method to evaluate the quality of the extracted sets of features and it has been successfully integrated in a rule-based classification framework [3]. Results Our algorithm has been applied to three viral datasets (i.e., Rhino-, Influenza-, Polyomaviruses [4-6]) and enables to extract all the alternative solutions of virus specimen to species assignments, by identifying portions of sequence that are discriminant, compact, and as shorter as possible. To conclude, we succeeded in extracting a wide set of equivalent classification rules, focusing on short regions of sequences with high reliability and low computational time, in order to provide the biologists with short and highly informative genome parts to be sequenced, as well as a powerful instrument both scientifically and diagnostically, e.g., for automatic virus detection.