1. Machine learning methods for transcription data integration
- Author
-
Charles DeLisi, Mark A. Kon, and Dustin T. Holloway
- Subjects
Biological data ,General Computer Science ,Computer science ,business.industry ,Machine learning ,computer.software_genre ,Weighting ,Support vector machine ,Statistical classification ,Statistical learning theory ,Subsequence ,Coding region ,Artificial intelligence ,business ,Gene ,computer - Abstract
Gene expression is modulated by transcription factors (TFs), which are proteins that generally bind to DNA adjacent to coding regions and initiate transcription. Each target gene can be regulated by more than one TF, and each TF can regulate many targets. For a complete molecular understanding of transcriptional regulation, researchers must first associate each TF with the set of genes that it regulates. Here we present a summary of completed work on the ability to associate 104 TFs with their binding sites using support vector machines (SVMs), which are classification algorithms based in statistical learning theory. We use several types of genomic datasets to train classifiers in order to predict TF binding in the yeast genome. We consider motif matches, subsequence counts, motif conservation, functional annotation, and expression profiles. A simple weighting scheme varies the contribution of each type of genomic data when building a final SVM classifier, which we evaluate using known binding sites published in the literature and in online databases. The SVM algorithm works best when all datasets are combined, producing 73% coverage of known interactions, with a prediction accuracy of almost 0.9. We discuss new ideas and preliminary work for improving SVM classification of biological data.
- Published
- 2006
- Full Text
- View/download PDF