Author: "Morihiro Hayashida" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Morihiro Hayashida"' showing total 131 results

Start Over Author "Morihiro Hayashida"

131 results on '"Morihiro Hayashida"'

1. Improving prediction of heterodimeric protein complexes using combination with pairwise kernel

Author: Peiying Ruan, Morihiro Hayashida, Tatsuya Akutsu, and Jean-Philippe Vert
Subjects: Heterodimeric protein complex, Combination kernel, Pairwise kernel, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Since many proteins become functional only after they interact with their partner proteins and form protein complexes, it is essential to identify the sets of proteins that form complexes. Therefore, several computational methods have been proposed to predict complexes from the topology and structure of experimental protein-protein interaction (PPI) network. These methods work well to predict complexes involving at least three proteins, but generally fail at identifying complexes involving only two different proteins, called heterodimeric complexes or heterodimers. There is however an urgent need for efficient methods to predict heterodimers, since the majority of known protein complexes are precisely heterodimers. Results In this paper, we use three promising kernel functions, Min kernel and two pairwise kernels, which are Metric Learning Pairwise Kernel (MLPK) and Tensor Product Pairwise Kernel (TPPK). We also consider the normalization forms of Min kernel. Then, we combine Min kernel or its normalization form and one of the pairwise kernels by plugging. We applied kernels based on PPI, domain, phylogenetic profile, and subcellular localization properties to predicting heterodimers. Then, we evaluate our method by employing C-Support Vector Classification (C-SVC), carrying out 10-fold cross-validation, and calculating the average F-measures. The results suggest that the combination of normalized-Min-kernel and MLPK leads to the best F-measure and improved the performance of our previous work, which had been the best existing method so far. Conclusions We propose new methods to predict heterodimers, using a machine learning-based approach. We train a support vector machine (SVM) to discriminate interacting vs non-interacting protein pairs, based on informations extracted from PPI, domain, phylogenetic profiles and subcellular localization. We evaluate in detail new kernel functions to encode these data, and report prediction performance that outperforms the state-of-the-art.
Published: 2018
Full Text: View/download PDF

2. Determining the minimum number of protein-protein interactions required to support known protein complexes.

Author: Natsu Nakajima, Morihiro Hayashida, Jesper Jansson, Osamu Maruyama, and Tatsuya Akutsu
Subjects: Medicine, Science
Abstract: The prediction of protein complexes from protein-protein interactions (PPIs) is a well-studied problem in bioinformatics. However, the currently available PPI data is not enough to describe all known protein complexes. In this paper, we express the problem of determining the minimum number of (additional) required protein-protein interactions as a graph theoretic problem under the constraint that each complex constitutes a connected component in a PPI network. For this problem, we develop two computational methods: one is based on integer linear programming (ILPMinPPI) and the other one is based on an existing greedy-type approximation algorithm (GreedyMinPPI) originally developed in the context of communication and social networks. Since the former method is only applicable to datasets of small size, we apply the latter method to a combination of the CYC2008 protein complex dataset and each of eight PPI datasets (STRING, MINT, BioGRID, IntAct, DIP, BIND, WI-PHI, iRefIndex). The results show that the minimum number of additional required PPIs ranges from 51 (STRING) to 964 (BIND), and that even the four best PPI databases, STRING (51), BioGRID (67), WI-PHI (93) and iRefIndex (85), do not include enough PPIs to form all CYC2008 protein complexes. We also demonstrate that the proposed problem framework and our solutions can enhance the prediction accuracy of existing PPI prediction methods. ILPMinPPI can be freely downloaded from http://sunflower.kuicr.kyoto-u.ac.jp/~nakajima/.
Published: 2018
Full Text: View/download PDF

3. Prediction of Protein-Protein Interaction Strength Using Domain Features with Supervised Regression

Author: Mayumi Kamada, Yusuke Sakuma, Morihiro Hayashida, and Tatsuya Akutsu
Subjects: Technology, Medicine, Science
Abstract: Proteins in living organisms express various important functions by interacting with other proteins and molecules. Therefore, many efforts have been made to investigate and predict protein-protein interactions (PPIs). Analysis of strengths of PPIs is also important because such strengths are involved in functionality of proteins. In this paper, we propose several feature space mappings from protein pairs using protein domain information to predict strengths of PPIs. Moreover, we perform computational experiments employing two machine learning methods, support vector regression (SVR) and relevance vector machine (RVM), for dataset obtained from biological experiments. The prediction results showed that both SVR and RVM with our proposed features outperformed the best existing method.
Published: 2014
Full Text: View/download PDF

4. Survival Analysis by Penalized Regression and Matrix Factorization

Author: Yeuntyng Lai, Morihiro Hayashida, and Tatsuya Akutsu
Subjects: Technology, Medicine, Science
Abstract: Because every disease has its unique survival pattern, it is necessary to find a suitable model to simulate followups. DNA microarray is a useful technique to detect thousands of gene expressions at one time and is usually employed to classify different types of cancer. We propose combination methods of penalized regression models and nonnegative matrix factorization (NMF) for predicting survival. We tried L1- (lasso), L2- (ridge), and L1-L2 combined (elastic net) penalized regression for diffuse large B-cell lymphoma (DLBCL) patients' microarray data and found that L1-L2 combined method predicts survival best with the smallest logrank P value. Furthermore, 80% of selected genes have been reported to correlate with carcinogenesis or lymphoma. Through NMF we found that DLBCL patients can be divided into 4 groups clearly, and it implies that DLBCL may have 4 subtypes which have a little different survival patterns. Next we excluded some patients who were indicated hard to classify in NMF and executed three penalized regression models again. We found that the performance of survival prediction has been improved with lower logrank P values. Therefore, we conclude that after preselection of patients by NMF, penalized regression models can predict DLBCL patients' survival successfully.
Published: 2013
Full Text: View/download PDF

5. Prediction of heterodimeric protein complexes from weighted protein-protein interaction networks using novel features and kernel functions.

Author: Peiying Ruan, Morihiro Hayashida, Osamu Maruyama, and Tatsuya Akutsu
Subjects: Medicine, Science
Abstract: Since many proteins express their functional activity by interacting with other proteins and forming protein complexes, it is very useful to identify sets of proteins that form complexes. For that purpose, many prediction methods for protein complexes from protein-protein interactions have been developed such as MCL, MCODE, RNSC, PCP, RRW, and NWE. These methods have dealt with only complexes with size of more than three because the methods often are based on some density of subgraphs. However, heterodimeric protein complexes that consist of two distinct proteins occupy a large part according to several comprehensive databases of known complexes. In this paper, we propose several feature space mappings from protein-protein interaction data, in which each interaction is weighted based on reliability. Furthermore, we make use of prior knowledge on protein domains to develop feature space mappings, domain composition kernel and its combination kernel with our proposed features. We perform ten-fold cross-validation computational experiments. These results suggest that our proposed kernel considerably outperforms the naive Bayes-based method, which is the best existing method for predicting heterodimeric protein complexes.
Published: 2013
Full Text: View/download PDF

6. Measuring the Similarity of Proteomes using Grammar-based Compression via Domain Combinations.

Author: Morihiro Hayashida, Hitoshi Koyano, and Jose C. Nacher
Published: 2020
Full Text: View/download PDF

7. Improving Accuracy and Speed of Network-based Intrusion Detection using Gradient Boosting Trees.

Author: Ryosuke Terado and Morihiro Hayashida
Published: 2020
Full Text: View/download PDF

8. Artificial Neural Network Approach to Prediction of Protein-RNA Residue-base Contacts.

Author: Morihiro Hayashida, Jose C. Nacher, and Hitoshi Koyano
Published: 2019
Full Text: View/download PDF

9. Grammar-based Compression for Directed and Undirected Generalized Series-parallel Graphs using Integer Linear Programming.

Author: Morihiro Hayashida, Hitoshi Koyano, and Tatsuya Akutsu
Published: 2018
Full Text: View/download PDF

10. Convolutional Neural Network Approach to Lung Cancer Classification Integrating Protein Interaction Network and Gene Expression Profiles.

Author: Teppei Matsubara, Tomoshiro Ochiai, Morihiro Hayashida, Tatsuya Akutsu, and Jose C. Nacher
Published: 2018
Full Text: View/download PDF

11. Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings.

Author: Morihiro Hayashida and Hitoshi Koyano
Published: 2016
Full Text: View/download PDF

12. Finding Median and Center Strings for a Probability Distribution on a Set of Strings Under Levenshtein Distance Based on Integer Linear Programming.

Author: Morihiro Hayashida and Hitoshi Koyano
Published: 2016
Full Text: View/download PDF

13. Host-Pathogen Protein Interaction Prediction Based on Local Topology Structures of a Protein Interaction Network.

Author: Jira Jindalertudomdee, Morihiro Hayashida, Jiangning Song, and Tatsuya Akutsu
Published: 2016
Full Text: View/download PDF

14. Predicting protein-RNA residue-base contacts using two-dimensional conditional random field.

Author: Morihiro Hayashida, Mayumi Kamada, Jiangning Song, and Tatsuya Akutsu
Published: 2012
Full Text: View/download PDF

15. Finding Conserved Regions in Protein Structures Using Support Vector Machines and Structure Alignment.

Author: Tatsuya Akutsu, Morihiro Hayashida, and Takeyuki Tamura
Published: 2012
Full Text: View/download PDF

16. A Quadsection Algorithm for Grammar-Based Image Compression.

Author: Morihiro Hayashida, Peiying Ruan, and Tatsuya Akutsu
Published: 2010
Full Text: View/download PDF

17. A Bipartite Graph Based Model of Protein Domain Networks.

Author: Jose C. Nacher, Tomoshiro Ochiai, Morihiro Hayashida, and Tatsuya Akutsu
Published: 2009
Full Text: View/download PDF

18. Integer programming-based methods for attractor detection and control of boolean networks.

Author: Tatsuya Akutsu, Morihiro Hayashida, and Takeyuki Tamura
Published: 2009
Full Text: View/download PDF

19. Image Compression-based Approach to Measuring the Similarity of Protein Structures.

Author: Morihiro Hayashida and Tatsuya Akutsu
Published: 2008

20. Algorithms for Inference, Analysis and Control of Boolean Networks.

Author: Tatsuya Akutsu, Morihiro Hayashida, and Takeyuki Tamura
Published: 2008
Full Text: View/download PDF

21. A Novel Clustering Method for Analysis of Biological Networks using Maximal Components of Graphs.

Author: Morihiro Hayashida, Tatsuya Akutsu, and Hiroshi Nagamochi
Published: 2007

22. Topological aspects of protein networks.

Author: Jose C. Nacher, Morihiro Hayashida, and Tatsuya Akutsu
Published: 2007
Full Text: View/download PDF

23. On the Complexity of Finding Control Strategies for Boolean Networks.

Author: Tatsuya Akutsu, Morihiro Hayashida, Wai-Ki Ching, and Michael K. Ng 0001
Published: 2006

24. Protein Threading with Profiles and Constraints.

Author: Tatsuya Akutsu, Morihiro Hayashida, Etsuji Tomita, Jun'ichi Suzuki, and Katsuhisa Horimoto
Published: 2004
Full Text: View/download PDF

25. Inferring strengths of protein-protein interactions from experimental data using linear programming.

Author: Morihiro Hayashida, Nobuhisa Ueda, and Tatsuya Akutsu
Published: 2003

26. Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings

Author: Hitoshi Koyano, Morihiro Hayashida, and Tatsuya Akutsu
Subjects: Computer Networks and Communications, Parametric Probability Distribution, Applied Mathematics, Posterior probability, Mathematics - Statistics Theory, Statistics Theory (math.ST), 0102 computer and information sciences, 02 engineering and technology, Mixture model, 01 natural sciences, String (physics), Theoretical Computer Science, High Energy Physics::Theory, ComputingMethodologies_PATTERNRECOGNITION, Asymptotically optimal algorithm, Computational Theory and Mathematics, Probability theory, 010201 computation theory & mathematics, 020204 information systems, Expectation–maximization algorithm, FOS: Mathematics, 0202 electrical engineering, electronic engineering, information engineering, Cluster analysis, Algorithm, Mathematics
Abstract: In this study, we address the problem of clustering string data in an unsupervised manner by developing a theory of a mixture model and an EM algorithm for string data based on probability theory on a topological monoid of strings developed in our previous studies. We first construct a parametric distribution on a set of strings in the motif of the Laplace distribution on a set of real numbers and reveal its basic properties. This Laplace-like distribution has two parameters: a string that represents the location of the distribution and a positive real number that represents the dispersion. It is difficult to explicitly write maximum likelihood estimators of the parameters because their log likelihood function is a complex function, the variables of which include a string; however, we construct estimators that almost surely converge to the maximum likelihood estimators as the number of observed strings increases and demonstrate that the estimators strongly consistently estimate the parameters. Next, we develop an iteration algorithm for estimating the parameters of the mixture model of the Laplace-like distributions and demonstrate that the algorithm almost surely converges to the EM algorithm for the Laplace-like mixture and strongly consistently estimates its parameters as the numbers of observed strings and iterations increase. Finally, we derive a procedure for unsupervised string clustering from the Laplace-like mixture that is asymptotically optimal in the sense that the posterior probability of making correct classifications is maximized., Comment: 56 pages
Published: 2019

27. Measuring the Similarity of Proteomes using Grammar-based Compression via Domain Combinations

Author: Jose C. Nacher, Hitoshi Koyano, and Morihiro Hayashida
Subjects: Grammar, Similarity (network science), business.industry, Computer science, media_common.quotation_subject, Compression (functional analysis), Pattern recognition, Artificial intelligence, business, Domain (software engineering), media_common
Published: 2020

28. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework

Author: Abdelkader Baggag, Jinxiang Chen, Fuyi Li, Morihiro Hayashida, Ya Wen, Halima Bensmail, Zongyuan Ge, Yanwei Yue, and Jiangning Song
Subjects: 0303 health sciences, Sequence analysis, Computer science, 0206 medical engineering, Datasets as Topic, Reproducibility of Results, Promoter, 02 engineering and technology, Computational biology, Ensemble learning, Machine Learning, 03 medical and health sciences, Tree (data structure), Sigma factor, Genes, Bacterial, Benchmark (computing), Consensus sequence, Escherichia coli, Problem Solving Protocol, AdaBoost, Promoter Regions, Genetic, Molecular Biology, 020602 bioinformatics, 030304 developmental biology, Information Systems
Abstract: Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing ‘Black-box’ approaches that are unable to reveal causal relationships from large amounts of initially encoded features.
Published: 2019

29. Improving conditional random field model for prediction of protein-RNA residue-base contacts

Author: Noriyuki Okada, Hitoshi Koyano, Mayumi Kamada, and Morihiro Hayashida
Subjects: Conditional random field, Receiver operating characteristic, Stochastic modelling, Applied Mathematics, RNA, Overfitting, Biochemistry, Genetics and Molecular Biology (miscellaneous), Computer Science Applications, Modeling and Simulation, Likelihood function, CRFS, Algorithm, Random variable, Mathematics
Abstract: For understanding biological cellular systems, it is important to analyze interactions between protein residues and RNA bases. A method based on conditional random fields (CRFs) was developed for predicting contacts between residues and bases, which receives multiple sequence alignments for given protein and RNA sequences, respectively, and learns the model with many parameters involved in relationships between neighboring residue-base pairs by maximizing the pseudo likelihood function. In this paper, we proposed a novel CRF-based model with more complicated dependency relationships between random variables than the previous model, but which takes less parameters for the sake of avoidance of overfitting to training data. We performed cross-validation experiments for evaluating the proposed model, and took the average of AUC (area under receiver operating characteristic curve) scores. The result suggests that the proposed CRF-based model without using L1-norm regularization (lasso) outperforms the existing model with and without the lasso under several input observations to CRFs. We proposed a novel stochastic model for predicting protein-RNA residue-base contacts, and improved the prediction accuracy in terms of the AUC score. It implies that more dependency relationships in a CRF could be controlled by less parameters.
Published: 2018

30. Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors

Author: Kuo-Chen Chou, Richard A. Strugnell, Yanju Zhang, Tatiana T. Marquez-Lago, Tatsuya Akutsu, André Leier, Trevor Lithgow, Morihiro Hayashida, Jiangning Song, Andrea Rocker, Jiawei Wang, and Bingjiao Yang
Subjects: 0301 basic medicine, Statistics and Probability, Sequence analysis, Bacterial genome size, Bioinformatics, Biochemistry, Machine Learning, 03 medical and health sciences, Bacterial Proteins, Sequence Analysis, Protein, Gram-Negative Bacteria, Amino Acid Sequence, Molecular Biology, Peptide sequence, Internet, 030102 biochemistry & molecular biology, Ensemble forecasting, Effector, Supervised learning, Computational Biology, Sequence Analysis, DNA, Type VI Secretion Systems, Original Papers, Computer Science Applications, Support vector machine, Computational Mathematics, 030104 developmental biology, Computational Theory and Mathematics, Software, Function (biology)
Abstract: Motivation Many Gram-negative bacteria use type VI secretion systems (T6SS) to export effector proteins into adjacent target cells. These secreted effectors (T6SEs) play vital roles in the competitive survival in bacterial populations, as well as pathogenesis of bacteria. Although various computational analyses have been previously applied to identify effectors secreted by certain bacterial species, there is no universal method available to accurately predict T6SS effector proteins from the growing tide of bacterial genome sequence data. Results We extracted a wide range of features from T6SE protein sequences and comprehensively analyzed the prediction performance of these features through unsupervised and supervised learning. By integrating these features, we subsequently developed a two-layer SVM-based ensemble model with fine-grain optimized parameters, to identify potential T6SEs. We further validated the predictive model using an independent dataset, which showed that the proposed model achieved an impressive performance in terms of ACC (0.943), F-value (0.946), MCC (0.892) and AUC (0.976). To demonstrate applicability, we employed this method to correctly identify two very recently validated T6SE proteins, which represent challenging prediction targets because they significantly differed from previously known T6SEs in terms of their sequence similarity and cellular function. Furthermore, a genome-wide prediction across 12 bacterial species, involving in total 54 212 protein sequences, was carried out to distinguish 94 putative T6SE candidates. We envisage both this information and our publicly accessible web server will facilitate future discoveries of novel T6SEs. Availability and implementation http://bastion6.erc.monash.edu/ Supplementary information Supplementary data are available at Bioinformatics online.
Published: 2018

31. Euler String-Based Compression of Tree-Structured Data and its Application to Analysis of RNAs

Author: Morihiro Hayashida, Yang Zhao, Tatsuya Akutsu, Liwei Liu, and Tomoya Mori
Subjects: Computational Mathematics, symbols.namesake, Computer science, Tree structured data, Compression (functional analysis), String (computer science), Genetics, Euler's formula, symbols, Molecular Biology, Biochemistry, Algorithm
Published: 2018

32. Domain-Based Approaches to Prediction and Analysis of Protein-Protein Interactions

Author: Morihiro Hayashida and Tatsuya Akutsu
Abstract: Protein-protein interactions play various essential roles in cellular systems. Many methods have been developed for inference of protein-protein interactions from protein sequence data. In this paper, the authors focus on methods based on domain-domain interactions, where a domain is defined as a region within a protein that either performs a specific function or constitutes a stable structural unit. In these methods, the probabilities of domain-domain interactions are inferred from known protein-protein interaction data and protein domain data, and then prediction of interactions is performed based on these probabilities and contents of domains of given proteins. This paper overviews several fundamental methods, which include association method, expectation maximization-based method, support vector machine-based method, linear programming-based method, and conditional random field-based method. This paper also reviews a simple evolutionary model of protein domains, which yields a scale-free distribution of protein domains. By combining with a domain-based protein interaction model, a scale-free distribution of protein-protein interaction networks is also derived.
Published: 2019

33. Enumeration Method for Structural Isomers Containing User-Defined Structures Based on Breadth-First Search Approach

Author: Jira Jindalertudomdee, Tatsuya Akutsu, and Morihiro Hayashida
Subjects: 0301 basic medicine, Chemical substance, Chemistry, Pharmaceutical, Computation, Breadth-first search, Structure (category theory), Computational Biology, Chemical formula, 03 medical and health sciences, Computational Mathematics, Search engine, 030104 developmental biology, Computational Theory and Mathematics, Drug Design, Modeling and Simulation, Genetics, Enumeration, Molecular Biology, Algorithm, Algorithms, Mathematics, Generator (mathematics)
Abstract: Enumeration of chemical structures satisfying given conditions is an important step in the discovery of new compounds and drugs, as well as the elucidation of the structure. One of the most frequently used conditions in the enumeration is the number of chemical elements that corresponds to the chemical formula. In this work, we propose a novel efficient enumeration algorithm, BfsStructEnum, which allows users to define desired cyclic structures and enumerates all nonredundant chemical compounds containing only defined structures as cyclic structures from a given chemical formula. To evaluate the performance, we confirm the number of enumerated structures of BfsStructEnum and MOLGEN 5.0, the latest version of a general-purpose structure generator. We also compare the computation time of BfsStructEnum with that of MOLGEN 5.0. The findings show that, given the same number of enumerated structures as MOLGEN 5.0, BfsStructEnum is significantly faster. By compressing a cyclic structure into a single node and representing chemical compounds by tree structures instead of normal graphs, the enumeration can be executed more efficiently.
Published: 2016

34. Complex network-based approaches to biomarker discovery

Author: Tatsuya Akutsu and Morihiro Hayashida
Subjects: 0301 basic medicine, Biochemistry (medical), Clinical Biochemistry, Computational biology, Biology, Complex network, Bioinformatics, Models, Biological, 03 medical and health sciences, 030104 developmental biology, Tissue Array Analysis, Expression data, Drug Discovery, Humans, Protein Interaction Maps, Observability, Biomarker discovery, Dynamical network, Centrality, Algorithms, Biomarkers, Protein Interaction Map
Abstract: Many studies on biomarker discovery have been done by analyzing mutations in DNA sequences and differences in gene expression patterns. As a new branch of the latter approach, the concept of network biomarkers has been proposed, in which expression data of small subnetworks are used as markers. Furthermore, network biomarkers have been extended to dynamical network biomarkers, in which time series expression data of subnetworks are used as markers. On the other hand, the methodologies in complex networks have also been applied to biomarker discovery. For example, various centrality measures and the concept of observability have been applied. In this article, we review these new approaches for biomarker discovery with focusing on the computational/methodological aspects.
Published: 2016

35. Bastion3: a two-layer ensemble predictor of type III secreted effectors

Author: Tatiana T. Marquez-Lago, Jiawei Wang, Trevor Lithgow, Kuo-Chen Chou, Joel Selkrig, Jiahui Li, Jiangning Song, André Leier, Tatsuya Akutsu, Yanju Zhang, Bingjiao Yang, Tieli Zhou, Ruopeng Xie, and Morihiro Hayashida
Subjects: Statistics and Probability, Gram-negative bacteria, Computer science, Value (computer science), Machine learning, computer.software_genre, Biochemistry, Bacterial protein, Machine Learning, 03 medical and health sciences, Protein sequencing, Bacterial Proteins, Genetic algorithm, Gram-Negative Bacteria, Secretion, Amino Acid Sequence, Molecular Biology, Peptide sequence, 030304 developmental biology, 0303 health sciences, biology, business.industry, 030302 biochemistry & molecular biology, Computational Biology, biology.organism_classification, Ensemble learning, Original Papers, Computer Science Applications, Computational Mathematics, Identification (information), Computational Theory and Mathematics, Host cell cytoplasm, Benchmark (computing), Artificial intelligence, business, computer, Algorithms, Software
Abstract: Motivation Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen–host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins’ complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model. Results In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models’ performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction. Availability and implementation http://bastion3.erc.monash.edu/ Contact selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu Supplementary information Supplementary data are available at Bioinformatics online.
Published: 2018

36. ncRNA-disease association prediction based on sequence information and tripartite network

Author: Jose C. Nacher, Hayliang Ngouv, Morihiro Hayashida, Tatsuya Akutsu, and Takuya Mori
Subjects: 0301 basic medicine, RNA, Untranslated, Computer science, Systems biology, Disease Association, Computational biology, Correlation, 03 medical and health sciences, Structural Biology, Neoplasms, Databases, Genetic, Humans, Disease, Resource allocation, lcsh:QH301-705.5, Molecular Biology, Sequence, Applied Mathematics, Research, ncRNA-disease association predictions, Computational Biology, Non-coding RNA, Computer Science Applications, 030104 developmental biology, lcsh:Biology (General), Modeling and Simulation, Kernel (statistics), Mutation (genetic algorithm), Algorithms, Tripartite network
Abstract: Background Current technology has demonstrated that mutation and deregulation of non-coding RNAs (ncRNAs) are associated with diverse human diseases and important biological processes. Therefore, developing a novel computational method for predicting potential ncRNA-disease associations could benefit pathologists in understanding the correlation between ncRNAs and disease diagnosis, treatment, and prevention. However, only a few studies have investigated these associations in pathogenesis. Results This study utilizes a disease-target-ncRNA tripartite network, and computes prediction scores between each disease-ncRNA pair by integrating biological information derived from pairwise similarity based upon sequence expressions with weights obtained from a multi-layer resource allocation technique. Our proposed algorithm was evaluated based on a 5-fold-cross-validation with optimal kernel parameter tuning. In addition, we achieved an average AUC that varies from 0.75 without link cut to 0.57 with link cut methods, which outperforms a previous method using the same evaluation methodology. Furthermore, the algorithm predicted 23 ncRNA-disease associations supported by other independent biological experimental studies. Conclusions Taken together, these results demonstrate the capability and accuracy of predicting further biological significant associations between ncRNAs and diseases and highlight the importance of adding biological sequence information to enhance predictions.
Published: 2018

37. Determining the minimum number of protein-protein interactions required to support known protein complexes

Author: Osamu Maruyama, Tatsuya Akutsu, Natsu Nakajima, Jesper Jansson, and Morihiro Hayashida
Subjects: 0301 basic medicine, Proteomics, Computer and Information Sciences, Theoretical computer science, Linear programming, Computer science, lcsh:Medicine, Context (language use), Research and Analysis Methods, Biochemistry, Protein–protein interaction, 03 medical and health sciences, Database and Informatics Methods, 0302 clinical medicine, Mathematical and Statistical Techniques, Protein Interaction Mapping, Computer Simulation, Statistical Methods, Linear Programming, lcsh:Science, Protein Interactions, Integer programming, Connected component, Multidisciplinary, Applied Mathematics, Simulation and Modeling, String (computer science), lcsh:R, Approximation algorithm, Biology and Life Sciences, Proteins, Protein Complexes, Computational Biology, Constraint (information theory), 030104 developmental biology, Protein-Protein Interactions, Proteins metabolism, Physical Sciences, lcsh:Q, Protein Interaction Networks, Mathematical Functions, 030217 neurology & neurosurgery, Mathematics, Statistics (Mathematics), Algorithms, Network Analysis, Research Article, Forecasting
Abstract: The prediction of protein complexes from protein-protein interactions (PPIs) is a well-studied problem in bioinformatics. However, the currently available PPI data is not enough to describe all known protein complexes. In this paper, we express the problem of determining the minimum number of (additional) required protein-protein interactions as a graph theoretic problem under the constraint that each complex constitutes a connected component in a PPI network. For this problem, we develop two computational methods: one is based on integer linear programming (ILPMinPPI) and the other one is based on an existing greedy-type approximation algorithm (GreedyMinPPI) originally developed in the context of communication and social networks. Since the former method is only applicable to datasets of small size, we apply the latter method to a combination of the CYC2008 protein complex dataset and each of eight PPI datasets (STRING, MINT, BioGRID, IntAct, DIP, BIND, WI-PHI, iRefIndex). The results show that the minimum number of additional required PPIs ranges from 51 (STRING) to 964 (BIND), and that even the four best PPI databases, STRING (51), BioGRID (67), WI-PHI (93) and iRefIndex (85), do not include enough PPIs to form all CYC2008 protein complexes. We also demonstrate that the proposed problem framework and our solutions can enhance the prediction accuracy of existing PPI prediction methods. ILPMinPPI can be freely downloaded from http://sunflower.kuicr.kyoto-u.ac.jp/~nakajima/.
Published: 2018

38. Improving prediction of heterodimeric protein complexes using combination with pairwise kernel

Author: Morihiro Hayashida, Jean-Philippe Vert, Tatsuya Akutsu, Peiying Ruan, National Institute of Advanced Industrial Science and Technology (AIST), Matsue National College of Technology, Bioinformatics Center (KEGG), Kyoto University, Centre de Bioinformatique (CBIO), Mines Paris - PSL (École nationale supérieure des mines de Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Cancer et génome: Bioinformatique, biostatistiques et épidémiologie d'un système complexe, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut Curie [Paris]-Institut National de la Santé et de la Recherche Médicale (INSERM), Département de Mathématiques et Applications - ENS Paris (DMA), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Vert, Jean-Philippe, Matsue College, Kyoto University [Kyoto], MINES ParisTech - École nationale supérieure des mines de Paris, Institut Curie [Paris]-MINES ParisTech - École nationale supérieure des mines de Paris, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de la Santé et de la Recherche Médicale (INSERM), Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), and École normale supérieure - Paris (ENS Paris)
Subjects: 0301 basic medicine, Normalization (statistics), Support Vector Machine, genetic structures, Computer science, Protein domain, information science, lcsh:Computer applications to medicine. Medical informatics, Biochemistry, 03 medical and health sciences, Kernel (linear algebra), Protein Domains, Structural Biology, Combination kernel, natural sciences, Protein Interaction Maps, Heterodimeric protein complex, Molecular Biology, lcsh:QH301-705.5, Phylogeny, ComputingMilieux_MISCELLANEOUS, Pairwise kernel, [SDV.BIBS] Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], 030102 biochemistry & molecular biology, Research, Applied Mathematics, food and beverages, [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], Computer Science Applications, Support vector machine, 030104 developmental biology, Tensor product, lcsh:Biology (General), Multiprotein Complexes, Kernel (statistics), lcsh:R858-859.7, Pairwise comparison, Protein Multimerization, DNA microarray, Biological system, Dimerization, Algorithms
Abstract: Background Since many proteins become functional only after they interact with their partner proteins and form protein complexes, it is essential to identify the sets of proteins that form complexes. Therefore, several computational methods have been proposed to predict complexes from the topology and structure of experimental protein-protein interaction (PPI) network. These methods work well to predict complexes involving at least three proteins, but generally fail at identifying complexes involving only two different proteins, called heterodimeric complexes or heterodimers. There is however an urgent need for efficient methods to predict heterodimers, since the majority of known protein complexes are precisely heterodimers. Results In this paper, we use three promising kernel functions, Min kernel and two pairwise kernels, which are Metric Learning Pairwise Kernel (MLPK) and Tensor Product Pairwise Kernel (TPPK). We also consider the normalization forms of Min kernel. Then, we combine Min kernel or its normalization form and one of the pairwise kernels by plugging. We applied kernels based on PPI, domain, phylogenetic profile, and subcellular localization properties to predicting heterodimers. Then, we evaluate our method by employing C-Support Vector Classification (C-SVC), carrying out 10-fold cross-validation, and calculating the average F-measures. The results suggest that the combination of normalized-Min-kernel and MLPK leads to the best F-measure and improved the performance of our previous work, which had been the best existing method so far. Conclusions We propose new methods to predict heterodimers, using a machine learning-based approach. We train a support vector machine (SVM) to discriminate interacting vs non-interacting protein pairs, based on informations extracted from PPI, domain, phylogenetic profiles and subcellular localization. We evaluate in detail new kernel functions to encode these data, and report prediction performance that outperforms the state-of-the-art.
Published: 2018

39. Grammar-based Compression for Directed and Undirected Generalized Series-parallel Graphs using Integer Linear Programming

Author: Hitoshi Koyano, Morihiro Hayashida, and Tatsuya Akutsu
Subjects: Discrete mathematics, Grammar, Computer science, media_common.quotation_subject, Compression (functional analysis), Series and parallel circuits, Integer programming, media_common
Published: 2018

40. Critical evaluation ofin silicomethods for prediction of coiled-coil domains in proteins

Author: Chen Li, Jeremy Nagel, Catherine Ching Han Chang, Ashley M. Buckle, Jiangning Song, Morihiro Hayashida, Benjamin T. Porebski, and Tatsuya Akutsu
Subjects: Models, Molecular, 0301 basic medicine, Coiled coil, Protein Conformation, Computer science, In silico, Proteins, Computational biology, 03 medical and health sciences, Validation methods, 030104 developmental biology, Models, Chemical, Protein Domains, Sequence Analysis, Protein, Papers, State prediction, Computer Simulation, Dimerization, Molecular Biology, Algorithms, Software, Information Systems
Abstract: Coiled-coils refer to a bundle of helices coiled together like strands of a rope. It has been estimated that nearly 3% of protein-encoding regions of genes harbour coiled-coil domains (CCDs). Experimental studies have confirmed that CCDs play a fundamental role in subcellular infrastructure and controlling trafficking of eukaryotic cells. Given the importance of coiled-coils, multiple bioinformatics tools have been developed to facilitate the systematic and high-throughput prediction of CCDs in proteins. In this article, we review and compare 12 sequence-based bioinformatics approaches and tools for coiled-coil prediction. These approaches can be categorized into two classes: coiled-coil detection and coiled-coil oligomeric state prediction. We evaluated and compared these methods in terms of their input/output, algorithm, prediction performance, validation methods and software utility. All the independent testing data sets are available at http://lightning.med.monash.edu/coiledcoil/. In addition, we conducted a case study of nine human polyglutamine (PolyQ) disease-related proteins and predicted CCDs and oligomeric states using various predictors. Prediction results for CCDs were highly variable among different predictors. Only two peptides from two proteins were confirmed to be CCDs by majority voting. Both domains were predicted to form dimeric coiled-coils using oligomeric state prediction. We anticipate that this comprehensive analysis will be an insightful resource for structural biologists with limited prior experience in bioinformatics tools, and for bioinformaticians who are interested in designing novel approaches for coiled-coil and its oligomeric state prediction.
Published: 2015

41. Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches

Author: Tatiana T. Marquez-Lago, Morihiro Hayashida, Yang Zhang, Jiawei Wang, André Leier, Geoffrey I. Webb, Jonathan J. Wilksch, Richard A. Strugnell, Yi An, Tatsuya Akutsu, Trevor Lithgow, Jiangning Song, Qingyang Hong, and Bingjiao Yang
Subjects: 0301 basic medicine, Paper, Support Vector Machine, Biology, Machine learning, computer.software_genre, Machine Learning, 03 medical and health sciences, Naive Bayes classifier, Bayes' theorem, Bacterial Proteins, Molecular Biology, Bacterial Secretion Systems, 030102 biochemistry & molecular biology, Artificial neural network, Effector, business.industry, Bayes Theorem, Ensemble learning, Random forest, Support vector machine, 030104 developmental biology, Multilayer perceptron, Artificial intelligence, business, computer, Algorithms, Information Systems
Abstract: In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of host-pathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Naive Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the 'majority voting strategy' led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.
Published: 2017

42. SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems

Author: Yi An, Jiangning Song, Tatsuya Akutsu, Geoffrey I. Webb, Jiawei Wang, Yang Zhang, Jerico Revote, Thomas Naderer, Chen Li, Trevor Lithgow, and Morihiro Hayashida
Subjects: 0301 basic medicine, Databases, Factual, Virulence Factors, Virulence, Biology, Article, Protein Structure, Secondary, Type three secretion system, Evolution, Molecular, Type IV Secretion Systems, 03 medical and health sciences, Protein structure, Bacterial Proteins, Type III Secretion Systems, Secretion, Type VI secretion system, Internet, Multidisciplinary, Bacteria, Effector, Type VI Secretion Systems, Cell biology, Metabolic pathway, 030104 developmental biology, Host-Pathogen Interactions, Function (biology)
Abstract: Bacteria translocate effector molecules to host cells through highly evolved secretion systems. By definition, the function of these effector proteins is to manipulate host cell biology and the sequence, structural and functional annotations of these effector proteins will provide a better understanding of how bacterial secretion systems promote bacterial survival and virulence. Here we developed a knowledgebase, termed SecretEPDB (Bacterial Secreted Effector Protein DataBase), for effector proteins of type III secretion system (T3SS), type IV secretion system (T4SS) and type VI secretion system (T6SS). SecretEPDB provides enriched annotations of the aforementioned three classes of effector proteins by manually extracting and integrating structural and functional information from currently available databases and the literature. The database is conservative and strictly curated to ensure that every effector protein entry is supported by experimental evidence that demonstrates it is secreted by a T3SS, T4SS or T6SS. The annotations of effector proteins documented in SecretEPDB are provided in terms of protein characteristics, protein function, protein secondary structure, Pfam domains, metabolic pathway and evolutionary details. It is our hope that this integrated knowledgebase will serve as a useful resource for biological investigation and the generation of new hypotheses for research efforts aimed at bacterial secretion systems.
Published: 2017

43. Finding Median and Center Strings for a Probability Distribution on a Set of Strings Under Levenshtein Distance Based on Integer Linear Programming

Author: Morihiro Hayashida and Hitoshi Koyano
Subjects: Discrete mathematics, Levenshtein automaton, business.industry, Pattern recognition, 02 engineering and technology, Approximate string matching, Levenshtein distance, 01 natural sciences, Set (abstract data type), High Energy Physics::Theory, Damerau–Levenshtein distance, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Probability distribution, 020201 artificial intelligence & image processing, Edit distance, Jaro–Winkler distance, Artificial intelligence, 010306 general physics, business, Mathematics
Abstract: For a data set composed of numbers or numerical vectors, a mean is the most fundamental measure for capturing the center of the data. However, for a data set of strings, a mean of the data cannot be defined, and therefore, median and center strings are frequently used as a measure of the center of the data. In contrast to calculating a mean of numerical data, constructing median and center strings of string data is not easy, and no algorithm is found that is guaranteed to construct exact solutions of center strings. In this study, we first generalize the definitions of median and center strings of string data into those of a probability distribution on a set of all strings composed of letters in a given alphabet. This generalization corresponds to that of a mean of numerical data into an expected value of a probability distribution on a set of numbers or numerical vectors. Next, we develop methods for constructing exact solutions of median and center strings for a probability distribution on a set of strings, applying integer linear programming. These methods are improved into faster ones by using the triangle inequality on the Levenshtein distance in the case where a set of strings is a metric space with the Levenshtein distance. Furthermore, we also develop methods for constructing approximate solutions of median and center strings very rapidly if the probability of a subset composed of similar strings is close to one. Lastly, we perform simulation experiments to examine the usefulness of our proposed methods in practical applications.
Published: 2017

44. Proteome compression via protein domain compositions

Author: Morihiro Hayashida, Peiying Ruan, and Tatsuya Akutsu
Subjects: Proteomics, Genetics, Proteome, biology, Saccharomyces cerevisiae, Protein domain, Computational biology, biology.organism_classification, Grammar-based compression, General Biochemistry, Genetics and Molecular Biology, Dictyostelium discoideum, Protein Structure, Tertiary, Protein domain composition, Integer linear programming, Sequence Analysis, Protein, Gene duplication, Schizosaccharomyces pombe, Animals, Humans, Drosophila melanogaster, Databases, Protein, Molecular Biology, Algorithms, Caenorhabditis elegans
Abstract: In this paper, we study domain compositions of proteins via compression of whole proteins in an organism for the sake of obtaining the entropy that the individual contains. We suppose that a protein is a multiset of domains. Since gene duplication and fusion have occurred through evolutionary processes, the same domains and the same compositions of domains appear in multiple proteins, which enables us to compress a proteome by using references to proteins for duplicated and fused proteins. Such a network with references to at most two proteins is modeled as a directed hypergraph. We propose a heuristic approach by combining the Edmonds algorithm and an integer linear programming, and apply our procedure to 14 proteomes of Dictyostelium discoideum, Escherichia coli, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Oryza sativa, Danio rerio, Xenopus laevis, Gallus gallus, Mus musculus, Pan troglodytes, and Homo sapiens. The compressed size using both of duplication and fusion was smaller than that using only duplication, which suggests the importance of fusion events in evolution of a proteome.
Published: 2014

45. Domain-Based Approaches to Prediction and Analysis of Protein-Protein Interactions

Author: Morihiro Hayashida and Tatsuya Akutsu
Subjects: Support vector machine, Conditional random field, Protein sequencing, Protein domain, Expectation–maximization algorithm, Interaction model, Data mining, Biology, computer.software_genre, Biological system, computer, Protein–protein interaction, Domain (software engineering)
Abstract: Protein-protein interactions play various essential roles in cellular systems. Many methods have been developed for inference of protein-protein interactions from protein sequence data. In this paper, the authors focus on methods based on domain-domain interactions, where a domain is defined as a region within a protein that either performs a specific function or constitutes a stable structural unit. In these methods, the probabilities of domain-domain interactions are inferred from known protein-protein interaction data and protein domain data, and then prediction of interactions is performed based on these probabilities and contents of domains of given proteins. This paper overviews several fundamental methods, which include association method, expectation maximization-based method, support vector machine-based method, linear programming-based method, and conditional random field-based method. This paper also reviews a simple evolutionary model of protein domains, which yields a scale-free distribution of protein domains. By combining with a domain-based protein interaction model, a scale-free distribution of protein-protein interaction networks is also derived.
Published: 2014

46. Prediction of Protein-Protein Interaction Strength Using Domain Features with Supervised Regression

Author: Morihiro Hayashida, Yusuke Sakuma, Tatsuya Akutsu, and Mayumi Kamada
Subjects: Article Subject, Computer science, Feature vector, Protein domain, lcsh:Medicine, Machine learning, computer.software_genre, lcsh:Technology, General Biochemistry, Genetics and Molecular Biology, Domain (software engineering), Protein–protein interaction, Relevance vector machine, Search engine, Artificial Intelligence, Protein Interaction Mapping, Protein Interaction Domains and Motifs, lcsh:Science, General Environmental Science, lcsh:T, business.industry, lcsh:R, Computational Biology, General Medicine, Regression, Support vector machine, lcsh:Q, Artificial intelligence, business, computer, Research Article
Abstract: Proteins in living organisms express various important functions by interacting with other proteins and molecules. Therefore, many efforts have been made to investigate and predict protein-protein interactions (PPIs). Analysis of strengths of PPIs is also important because such strengths are involved in functionality of proteins. In this paper, we propose several feature space mappings from protein pairs using protein domain information to predict strengths of PPIs. Moreover, we perform computational experiments employing two machine learning methods, support vector regression (SVR) and relevance vector machine (RVM), for dataset obtained from biological experiments. The prediction results showed that both SVR and RVM with our proposed features outperformed the best existing method.
Published: 2014

47. Convolutional neural network approach to lung cancer classification integrating protein interaction network and gene expression profiles

Author: Jose C. Nacher, Tomoshiro Ochiai, Morihiro Hayashida, Teppei Matsubara, and Tatsuya Akutsu
Subjects: 0301 basic medicine, Lung Neoplasms, Support Vector Machine, Computer science, Systems biology, 0206 medical engineering, 02 engineering and technology, Machine learning, computer.software_genre, Biochemistry, Convolutional neural network, Field (computer science), Machine Learning, Random Allocation, 03 medical and health sciences, 0302 clinical medicine, Interaction network, Cluster Analysis, Humans, Protein Interaction Maps, Molecular Biology, 030304 developmental biology, 0303 health sciences, Artificial neural network, business.industry, Deep learning, Reproducibility of Results, Pattern recognition, Complex network, Spectral clustering, Computer Science Applications, Random forest, Support vector machine, 030104 developmental biology, ComputingMethodologies_PATTERNRECOGNITION, 030220 oncology & carcinogenesis, Neural Networks, Computer, Artificial intelligence, Transcriptome, business, computer, Algorithms, 020602 bioinformatics
Abstract: Deep learning technologies are permeating every field from image and speech recognition to computational and systems biology. However, the application of convolutional neural networks (CCNs) to “omics” data poses some difficulties, such as the processing of complex networks structures as well as its integration with transcriptome data. Here, we propose a CNN approach that combines spectral clustering information processing to classify lung cancer. The developed spectral-convolutional neural network based method achieves success in integrating protein interaction network data and gene expression profiles to classify lung cancer. The performed computational experiments suggest that in terms of accuracy the predictive performance of our proposed method was better than those of other machine learning methods such as SVM or Random Forest. Moreover, the computational results also indicate that the underlying protein network structure assists to enhance the predictions. Data and CNN code can be downloaded from the link: https://sites.google.com/site/nacherlab/analysis
Published: 2019

48. LBSizeCleav: improved support vector machine (SVM)-based prediction of Dicer cleavage sites using loop/bulge length

Author: Tatsuya Akutsu, Yu Bao, and Morihiro Hayashida
Subjects: 0301 basic medicine, Ribonuclease III, Support Vector Machine, Feature vector, 0206 medical engineering, Improved method, 02 engineering and technology, Computational biology, Biology, Cleavage (embryo), computer.software_genre, Biochemistry, DEAD-box RNA Helicases, 03 medical and health sciences, Naive Bayes classifier, Structural Biology, Bulge, RNA Precursors, Humans, Loop/bulge length, Molecular Biology, Base Sequence, Applied Mathematics, Bayes Theorem, Random forest, Computer Science Applications, Support vector machine, MicroRNAs, 030104 developmental biology, biology.protein, Nucleic Acid Conformation, Data mining, Dicer cleavage site, computer, 020602 bioinformatics, Algorithms, Software, Dicer, Research Article
Abstract: Background: Dicer is necessary for the process of mature microRNA (miRNA) formation because the Dicer enzyme cleaves pre-miRNA correctly to generate miRNA with correct seed regions. Nonetheless, the mechanism underlying the selection of a Dicer cleavage site is still not fully understood. To date, several studies have been conducted to solve this problem, for example, a recent discovery indicates that the loop/bulge structure plays a central role in the selection of Dicer cleavage sites. In accordance with this breakthrough, a support vector machine (SVM)-based method called PHDCleav was developed to predict Dicer cleavage sites which outperforms other methods based on random forest and naive Bayes. PHDCleav, however, tests only whether a position in the shift window belongs to a loop/bulge structure. Result: In this paper, we used the length of loop/bulge structures (in addition to their presence or absence) to develop an improved method, LBSizeCleav, for predicting Dicer cleavage sites. To evaluate our method, we used 810 empirically validated sequences of human pre-miRNAs and performed fivefold cross-validation. In both 5p and 3p arms of pre-miRNAs, LBSizeCleav showed greater prediction accuracy than PHDCleav did. This result suggests that the length of loop/bulge structures is useful for prediction of Dicer cleavage sites. Conclusion: We developed a novel algorithm for feature space mapping based on the length of a loop/bulge for predicting Dicer cleavage sites. The better performance of our method indicates the usefulness of the length of loop/bulge structures for such predictions.
Published: 2016

49. Host-Pathogen Protein Interaction Prediction Based on Local Topology Structures of a Protein Interaction Network

Author: Morihiro Hayashida, Jiangning Song, Jira Jindalertudomdee, and Tatsuya Akutsu
Subjects: 0301 basic medicine, Mechanism (biology), Host (biology), Local topology, A protein, Computational biology, Biology, Cell biology, Biological pathway, 03 medical and health sciences, 030104 developmental biology, Interaction network, Pathogen, Function (biology)
Abstract: Understanding how pathogen's proteins interact with its host's proteins is the key concept for understanding pathogen's infection mechanism, which can lead to the discovery of improved therapeutics for treating infectious diseases. Several studies suggest that proteins from various pathogens tend to interact with human proteins involved in the same biological pathway. This implies that pathogens are inclined to target host's proteins with similar function. In addition, conservation between a protein's function and its local topological structure in a protein-protein interaction network (PIN) has been previously characterized. This leads to the hypothesis that pathogens target the host's proteins with a similar local topological structure in a PIN. In this work, this hypothesis is examined by adding a graphlet degree vector of a protein in the human PIN as a feature in the prediction model and using that model to predict the protein-protein interaction between human and four pathogens. The results show that this graphlet degree vector increases the performance significantly for all pathogens. This suggests that the intraspecies protein-protein interactions should be taken into consideration when developing prediction methods for host-pathogen protein interaction. The results also support the hypothesis that there exists a relationship between a protein's function and the local topology of the PIN.
Published: 2016

50. Maximum margin classifier working in a set of strings

Author: Morihiro Hayashida, Hitoshi Koyano, and Tatsuya Akutsu
Subjects: 0301 basic medicine, FOS: Computer and information sciences, General Mathematics, 0206 medical engineering, General Engineering, General Physics and Astronomy, Machine Learning (stat.ML), 02 engineering and technology, 62G20, 68Q32, 86W32, Support vector machine, 03 medical and health sciences, Search engine, 030104 developmental biology, Asymptotically optimal algorithm, Probability theory, String kernel, Statistics - Machine Learning, Margin classifier, Classifier (UML), Algorithm, 020602 bioinformatics, Research Articles, Vector space, Mathematics
Abstract: Numbers and numerical vectors account for a large portion of data. However, recently the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem on the asymptotic behavior of a consensus sequence of strings, which is the counterpart of the mean of numerical vectors, as demonstrated in the probability theory on a metric space of strings developed by one of the authors and his colleague in a previous study [18]. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein--protein interactions using amino acid sequences., This manuscript has been withdrawn because the experiments in Section 6 are insufficient
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

131 results on '"Morihiro Hayashida"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources