Ronnie Alves, Anthony Boureux, Thérèse Commes, Sacha Beaumeunier, Jérôme Audoux, Florence Ruffle, Nicolas Philippe, Cellules Souches, Plasticité Cellulaire, Médecine Régénératrice et Immunothérapies (IRMB), Université de Montpellier (UM)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre Hospitalier Régional Universitaire [Montpellier] (CHRU Montpellier), Institut de Biologie Computationnelle (IBC), Université de Montpellier (UM)-Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Programa de Pós-Graduação em Ciências Contábeis [Belém, Brazil], Universidade Federal do Pará - UFPA [Belém, Brazil]-Instituto Tecnológico Vale [Belém, Brazil], Centre Hospitalier Régional Universitaire [Montpellier] (CHRU Montpellier)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Montpellier (UM), Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Méthodes et Algorithmes pour la Bioinformatique (MAB), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Instituto Tecnológico Vale [Belém, Brazil] (ITV), and BMC, BMC
Background High-throughput sequencing technology and bioinformatics have identified chimeric RNAs (chRNAs), raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis. Results The task of discriminating true chRNAs from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data may contain false reads due to technical artifacts and during the analysis process, bioinformatics tools may generate false positives due to methodological biases. Moreover, if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it. Like any other machine learning problem, the first big issue is finding the good data to build models. As far as we were concerned, there is no common benchmark data available for chRNAs detection. The definition of a classification baseline is lacking in the related literature too. In this work we are moving towards benchmark data and an evaluation of the fidelity of supervised classifiers in the prediction of chRNAs. Conclusions We proposed a modelization strategy that can be used to increase the tools performances in context of chRNA classification based on a simulated data generator, that permit to continuously integrate new complex chimeric events. The pipeline incorporated a genome mutation process and simulated RNA-seq data. The reads within distinct depth were aligned and analysed by CRAC that integrates genomic location and local coverage, allowing biological predictions at the read scale. Additionally, these reads were functionally annotated and aggregated to form chRNAs events, making it possible to evaluate ML methods (classifiers) performance in both levels of reads and events. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95 % (ACC=94 %, Kappa=0.87 %). The resulting classification models were also tested on real RNA-seq data from a set of twenty-seven patients with acute myeloid leukemia (AML). Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0112-6) contains supplementary material, which is available to authorized users.