69 results on '"Koji Tsuda"'
Search Results
2. Selective Inference for High-order Interaction Features Selected in a Stepwise Manner
- Author
-
Yuta Umezu, Ichiro Takeuchi, Shinya Suzumura, Kazuya Nakagawa, and Koji Tsuda
- Subjects
Computer science ,business.industry ,Inference ,Pattern recognition ,Artificial intelligence ,High order ,business ,Biochemistry, Genetics and Molecular Biology (miscellaneous) ,Computer Science Applications - Published
- 2021
3. Pushing property limits in materials discovery via boundless objective-free exploration
- Author
-
Shinsuke Ishihara, Masato Sumita, Daniel T. Payne, Ryo Tamura, Kei Terayama, Mandeep K. Chahal, and Koji Tsuda
- Subjects
Property (philosophy) ,Computer science ,Process (engineering) ,Kernel (statistics) ,General Chemistry ,Biochemical engineering ,Scientific disciplines ,Repurposing ,Boundary (real estate) - Abstract
Materials chemists develop chemical compounds to meet often conflicting demands of industrial applications. This process may not be properly modeled by black-box optimization because the target property is not well defined in some cases. Herein, we propose a new algorithm for automated materials discovery called BoundLess Objective-free eXploration (BLOX) that uses a novel criterion based on kernel-based Stein discrepancy in the property space. Unlike other objective-free exploration methods, a boundary for the materials properties is not needed; hence, BLOX is suitable for open-ended scientific endeavors. We demonstrate the effectiveness of BLOX by finding light-absorbing molecules from a drug database. Our goal is to minimize the number of density functional theory calculations required to discover out-of-trend compounds in the intensity-wavelength property space. Using absorption spectroscopy, we experimentally verified that eight compounds identified as outstanding exhibit the expected optical properties. Our results show that BLOX is useful for chemical repurposing, and we expect this search method to have numerous applications in various scientific disciplines.
- Published
- 2020
4. Efficient query autocompletion with edit distance-based error tolerance
- Author
-
Chuan Xiao, Kunihiko Sadakane, Jie Zhang, Jianbin Qin, Koji Tsuda, Sheng Hu, Yoshiharu Ishikawa, and Wei Wang
- Subjects
Theoretical computer science ,Query string ,Computer science ,Process (computing) ,02 engineering and technology ,Prefix ,Index (publishing) ,Hardware and Architecture ,020204 information systems ,Trie ,0202 electrical engineering, electronic engineering, information engineering ,Feature (machine learning) ,020201 artificial intelligence & image processing ,Edit distance ,Information Systems - Abstract
Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper, we study the problem of query autocompletion that tolerates errors in users’ input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distances from the query string are within the given threshold. The major inherent drawback of these approaches is that the number of such prefixes is huge for the first few characters of the query string and is exponential in the alphabet size. This results in slow query response even if the entire query approximately matches only few prefixes. We propose a novel neighborhood generation-based method to process error-tolerant query autocompletion. Our proposed method only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal, a core problem in fetching query answers, and extend our method to support top-k queries. Optimization techniques are proposed to reduce the index size. The efficiency of our method is demonstrated through extensive experiments on real datasets.
- Published
- 2019
5. Molecular generation by Fast Assembly of (Deep)SMILES fragments
- Author
-
Francois Berenger and Koji Tsuda
- Subjects
Multi-core processor ,Matching (graph theory) ,Property (programming) ,business.industry ,Computer science ,Molecular fragments ,Information technology ,Library and Information Sciences ,SMILES ,T58.5-58.64 ,Computer Graphics and Computer-Aided Design ,DeepSMILES ,Computer Science Applications ,Chemistry ,Software ,String operations ,Simple (abstract algebra) ,Benchmark (computing) ,Molecular generation ,Physical and Theoretical Chemistry ,business ,QD1-999 ,Algorithm ,Research Article ,Generator (mathematics) - Abstract
Background In recent years, in silico molecular design is regaining interest. To generate on a computer molecules with optimized properties, scoring functions can be coupled with a molecular generator to design novel molecules with a desired property profile. Results In this article, a simple method is described to generate only valid molecules at high frequency ($$>300,000$$ > 300 , 000 molecule/s using a single CPU core), given a molecular training set. The proposed method generates diverse SMILES (or DeepSMILES) encoded molecules while also showing some propensity at training set distribution matching. When working with DeepSMILES, the method reaches peak performance ($$>340,000$$ > 340 , 000 molecule/s) because it relies almost exclusively on string operations. The “Fast Assembly of SMILES Fragments” software is released as open-source at https://github.com/UnixJunkie/FASMIFRA. Experiments regarding speed, training set distribution matching, molecular diversity and benchmark against several other methods are also shown.
- Published
- 2021
6. Machine-learning-guided Protein Design
- Author
-
Mitsuo Umetsu, Tomoshi Kameda, Koji Tsuda, and Yutaka Saito
- Subjects
Computer science ,business.industry ,Protein design ,Artificial intelligence ,Machine learning ,computer.software_genre ,business ,computer - Published
- 2021
7. Machine-learning-guided library design cycle for directed evolution of enzymes: the effects of training data composition on sequence space exploration
- Author
-
Misaki Oikawa, Mitsuo Umetsu, T. J. Sato, Tomoshi Kameda, Koji Tsuda, Hikaru Nakazawa, Tomoyuki Ito, and Yutaka Saito
- Subjects
Library design ,Training set ,Series (mathematics) ,business.industry ,Computer science ,media_common.quotation_subject ,General Chemistry ,Protein engineering ,Computational biology ,Composition (combinatorics) ,Machine learning ,computer.software_genre ,Directed evolution ,Catalysis ,Abstract machine ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial intelligence ,Sequence space (evolution) ,Function (engineering) ,business ,computer ,Function (biology) ,media_common ,Sequence (medicine) - Abstract
Machine learning (ML) is becoming an attractive tool in mutagenesis-based protein engineering because of its ability to design a variant library containing proteins with a desired function. However, it remains unclear how ML guides directed evolution in sequence space depending on the composition of training data. Here, we present a ML-guided directed evolution study of an enzyme to investigate the effects of a known “highly positive” variant (i.e., variant known to have high enzyme activity) in training data. We performed two separate series of ML-guided directed evolution of Sortase A with and without a known highly positive variant called 5M in training data. In each series, two rounds of ML were conducted: variants predicted by the first round were experimentally evaluated, and used as additional training data for the second-round prediction. The improvements in enzyme activity were comparable between the two series, both achieving enzyme activity 2.2–2.5 times higher than 5M. Intriguingly, the sequences of the improved variants were largely different between the two series, indicating that ML guided the directed evolution to the distinct regions of sequence space depending on the presence/absence of the highly positive variant in the training data. This suggests that the sequence diversity of improved variants can be expanded not only by conventional ML using the whole training data, but also by ML using a subset of the training data even when it lacks highly positive variants. In summary, this study demonstrates the importance of regulating the composition of training data in ML-guided directed evolution.
- Published
- 2021
8. Determination of quasi-primary odors by endpoint detection
- Author
-
Koji Tsuda, Kota Shiba, Genki Yoshikawa, Makito Nakatsu, Kosuke Minami, Koki Kitai, Ryo Tamura, and Hanxiao Xu
- Subjects
Multidisciplinary ,Computer science ,business.industry ,Science ,musculoskeletal, neural, and ocular physiology ,Pattern recognition ,02 engineering and technology ,010402 general chemistry ,021001 nanoscience & nanotechnology ,01 natural sciences ,Article ,Techniques and instrumentation ,0104 chemical sciences ,Applied physics ,Set (abstract data type) ,Chemical engineering ,Odor ,Medicine ,Artificial intelligence ,0210 nano-technology ,business ,psychological phenomena and processes - Abstract
It is known that there are no primary odors that can represent any other odors with their combination. Here, we propose an alternative approach: “quasi” primary odors. This approach comprises the following condition and method: (1) within a collected dataset and (2) by the machine learning-based endpoint detection. The quasi-primary odors are selected from the odors included in a collected odor dataset according to the endpoint score. While it is limited within the given dataset, the combination of such quasi-primary odors with certain ratios can reproduce any other odor in the dataset. To visually demonstrate this approach, the three quasi-primary odors having top three high endpoint scores are assigned to the vertices of a chromaticity triangle with red, green, and blue. Then, the other odors in the dataset are projected onto the chromaticity triangle to have their unique colors. The number of quasi-primary odors is not limited to three but can be set to an arbitrary number. With this approach, one can first find “extreme” odors (i.e., quasi-primary odors) in a given odor dataset, and then, reproduce any other odor in the dataset or even synthesize a new arbitrary odor by combining such quasi-primary odors with certain ratios.
- Published
- 2021
9. Enhancing Biomolecular Sampling with Reinforcement Learning: A Tree Search Molecular Dynamics Simulation Method
- Author
-
Akio Kitao, Koji Tsuda, Kei Terayama, Kazuhiro Takemura, Duy Phuoc Tran, and Kento Shin
- Subjects
Computer science ,General Chemical Engineering ,Computation ,Sampling (statistics) ,General Chemistry ,Folding (DSP implementation) ,Article ,Molecular dynamics ,Tree (data structure) ,Tree traversal ,Chemistry ,Reinforcement learning ,Algorithm ,QD1-999 ,Selection (genetic algorithm) - Abstract
This paper proposes a novel molecular simulation method, called tree search molecular dynamics (TS-MD), to accelerate the sampling of conformational transition pathways, which require considerable computation. In TS-MD, a tree search algorithm, called upper confidence bounds for trees, which is a type of reinforcement learning algorithm, is applied to sample the transition pathway. By learning from the results of the previous simulations, TS-MD efficiently searches conformational space and avoids being trapped in local stable structures. TS-MD exhibits better performance than parallel cascade selection molecular dynamics, which is one of the state-of-the-art methods, for the folding of miniproteins, Chignolin and Trp-cage, in explicit water.
- Published
- 2019
10. Application of Bayesian Optimization for Pharmaceutical Product Development
- Author
-
Tadashi Kadowaki, Susumu Kimura, Syusuke Sano, and Koji Tsuda
- Subjects
Hyperparameter ,Mathematical optimization ,Expediting ,Computer science ,business.industry ,Process (engineering) ,Design of experiments ,Bayesian optimization ,Pharmaceutical Science ,Random search ,Drug Discovery ,New product development ,business ,Global optimization - Abstract
Bayesian optimization has been studied in many fields as a technique for global optimization of black-box functions. We applied these techniques for optimizing the formulation and manufacturing methods of pharmaceutical products to eliminate unnecessary experiments and accelerate method development tasks. A simulation dataset was generated by the data augmentation from a design of experiment (DoE) which was executed to optimize the formulation and process parameters of orally disintegrating tablets. We defined a composite score for integrating multiple objective functions, physical properties of tablets, to meet the pharmaceutical criteria simultaneously. Performance measurements were used to compare the influence of the selection of initial training sets, by controlling data size and variation, acquisition functions, and schedules of hyperparameter tuning. Additionally, we investigated performance improvements obtained using Bayesian optimization techniques as opposed to random search strategy. Bayesian optimization efficiently reduces the number of experiments to obtain the optimal formulation and process parameters from about 25 experiments with DoE to 10 experiments. Repeated hyperparameter tuning during the Bayesian optimization process stabilizes variations in performance among different optimization conditions, thus improving average performance. We demonstrated the elimination of unnecessary experiments using Bayesian optimization. Simulations of different conditions depicted their dependencies, which will be useful in many real-world applications. Bayesian optimization is expected to reduce the reliance on individual skills and experiences, increasing the efficiency and efficacy of optimization tasks, expediting formulation and manufacturing research in pharmaceutical development.
- Published
- 2019
11. Using molecular dynamics simulations to prioritize and understand AI-generated cell penetrating peptides
- Author
-
Yoshihiro Ito, Koji Tsuda, Akiko Yumoto, Akio Kitao, Seiichi Tada, Duy Phuoc Tran, and Takanori Uzawa
- Subjects
0301 basic medicine ,Computer science ,Cell Survival ,Science ,Peptide ,02 engineering and technology ,Computational biology ,Cell-Penetrating Peptides ,Molecular Dynamics Simulation ,Article ,03 medical and health sciences ,Molecular dynamics ,Membrane biophysics ,Computational biophysics ,Artificial Intelligence ,Machine learning ,Statistical inference ,Leverage (statistics) ,Humans ,Sample variance ,Amino Acid Sequence ,chemistry.chemical_classification ,Multidisciplinary ,Cell Membrane ,Reproducibility of Results ,Experimental validation ,021001 nanoscience & nanotechnology ,Computational biology and bioinformatics ,030104 developmental biology ,chemistry ,Drug delivery ,Medicine ,Permeation and transport ,0210 nano-technology ,Biological physics ,HeLa Cells - Abstract
Cell-penetrating peptides have important therapeutic applications in drug delivery, but the variety of known cell-penetrating peptides is still limited. With a promise to accelerate peptide development, artificial intelligence (AI) techniques including deep generative models are currently in spotlight. Scientists, however, are often overwhelmed by an excessive number of unannotated sequences generated by AI and find it difficult to obtain insights to prioritize them for experimental validation. To avoid this pitfall, we leverage molecular dynamics (MD) simulations to obtain mechanistic information to prioritize and understand AI-generated peptides. A mechanistic score of permeability is computed from five steered MD simulations starting from different initial structures predicted by homology modelling. To compensate for variability of predicted structures, the score is computed with sample variance penalization so that a peptide with consistent behaviour is highly evaluated. Our computational pipeline involving deep learning, homology modelling, MD simulations and synthesizability assessment generated 24 novel peptide sequences. The top-scoring peptide showed a consistent pattern of conformational change in all simulations regardless of initial structures. As a result of wet-lab-experiments, our peptide showed better permeability and weaker toxicity in comparison to a clinically used peptide, TAT. Our result demonstrates how MD simulations can support de novo peptide design by providing mechanistic information supplementing statistical inference.
- Published
- 2021
12. CompRet: a comprehensive recommendation framework for chemical synthesis planning with algorithmic enumeration
- Author
-
Koji Tsuda, Kei Terayama, Kiyosei Takasu, Yasushi Okuno, Ryosuke Shibukawa, Kunihiro Wasa, Shoichi Ishida, and Kazuki Yoshizoe
- Subjects
0301 basic medicine ,Theoretical computer science ,Enumeration ,Computer science ,Enumeration algorithm ,Library and Information Sciences ,Mathematical proof ,Machine learning ,computer.software_genre ,01 natural sciences ,Chemical synthesis ,lcsh:Chemistry ,Computer-assisted synthesis planning ,03 medical and health sciences ,Search algorithm ,Physical and Theoretical Chemistry ,CASP ,Retrosynthetic analysis ,lcsh:T58.5-58.64 ,lcsh:Information technology ,business.industry ,Retosynthesis ,Preliminary Communication ,Computer Graphics and Computer-Aided Design ,0104 chemical sciences ,Computer Science Applications ,010404 medicinal & biomolecular chemistry ,030104 developmental biology ,Exact algorithm ,lcsh:QD1-999 ,Artificial intelligence ,business ,computer - Abstract
In computer-assisted synthesis planning (CASP) programs, providing as many chemical synthetic routes as possible is essential for considering optimal and alternative routes in a chemical reaction network. As the majority of CASP programs have been designed to provide one or a few optimal routes, it is likely that the desired one will not be included. To avoid this, an exact algorithm that lists possible synthetic routes within the chemical reaction network is required, alongside a recommendation of synthetic routes that meet specified criteria based on the chemist’s objectives. Herein, we propose a chemical-reaction-network-based synthetic route recommendation framework called “CompRet” with a mathematically guaranteed enumeration algorithm. In a preliminary experiment, CompRet was shown to successfully provide alternative routes for a known antihistaminic drug, cetirizine. CompRet is expected to promote desirable enumeration-based chemical synthesis searches and aid the development of an interactive CASP framework for chemists.
- Published
- 2020
13. Exploring Successful Parameter Region for Coarse-Grained Simulation of Biomolecules by Bayesian Optimization and Active Learning
- Author
-
Ryo Kanada, Kei Terayama, Koji Tsuda, Yasushi Okuno, and Atsushi Tokuhisa
- Subjects
Protein Folding ,Process (engineering) ,Active learning (machine learning) ,Computer science ,lcsh:QR1-502 ,Brute-force search ,Molecular Dynamics Simulation ,01 natural sciences ,Biochemistry ,lcsh:Microbiology ,Article ,Machine Learning ,Molecular dynamics ,Robustness (computer science) ,active learning ,0103 physical sciences ,Sensitivity (control systems) ,010306 general physics ,Molecular Biology ,bayesian optimization ,010302 applied physics ,Bayesian optimization ,biological rotary motor ,Proton-Translocating ATPases ,coarse-grained molecular dynamics simulation ,Structural biology ,Biological system - Abstract
Accompanied with an increase of revealed biomolecular structures owing to advancements in structural biology, the molecular dynamics (MD) approach, especially coarse-grained (CG) MD suitable for macromolecules, is becoming increasingly important for elucidating their dynamics and behavior. In fact, CG-MD simulation has succeeded in qualitatively reproducing numerous biological processes for various biomolecules such as conformational changes and protein folding with reasonable calculation costs. However, CG-MD simulations strongly depend on various parameters, and selecting an appropriate parameter set is necessary to reproduce a particular biological process. Because exhaustive examination of all candidate parameters is inefficient, it is important to identify successful parameters. Furthermore, the successful region, in which the desired process is reproducible, is essential for describing the detailed mechanics of functional processes and environmental sensitivity and robustness. We propose an efficient search method for identifying the successful region by using two machine learning techniques, Bayesian optimization and active learning. We evaluated its performance using F1-ATPase, a biological rotary motor, with CG-MD simulations. We successfully identified the successful region with lower computational costs (12.3% in the best case) without sacrificing accuracy compared to exhaustive search. This method can accelerate not only parameter search but also biological discussion of the detailed mechanics of functional processes and environmental sensitivity based on MD simulation studies.
- Published
- 2020
14. Computer Vision-Based Approach for Quantifying Occupational Therapists’ Qualitative Evaluations of Postural Control
- Author
-
Naoto Ienaga, Hiroyuki Ishihara, Haruka Noda, Hiromichi Hagihara, Daiki Enomoto, Koji Tsuda, Shuhei Takahata, and Kei Terayama
- Subjects
Male ,Occupational therapy ,medicine.medical_specialty ,Quantification methods ,Article Subject ,Computer science ,Quantitative Evaluations ,RM1-950 ,Postural control ,Child Development ,Occupational Therapists ,Occupational Therapy ,Postural Balance ,medicine ,Humans ,Leverage (statistics) ,Computer vision ,Motor skill ,business.industry ,General Medicine ,Telemedicine ,Motor Skills ,Child, Preschool ,Female ,Therapeutics. Pharmacology ,Artificial intelligence ,business ,Research Article - Abstract
This study aimed to leverage computer vision (CV) technology to develop a technique for quantifying postural control. A conventional quantitative index, occupational therapists’ qualitative clinical evaluations, and CV-based quantitative indices using an image analysis algorithm were applied to evaluate the postural control of 34 typically developed preschoolers. The effectiveness of the CV-based indices was investigated relative to current methods to explore the clinical applicability of the proposed method. The capacity of the CV-based indices to reflect therapists’ qualitative evaluations was confirmed. Furthermore, compared to the conventional quantitative index, the CV-based indices provided more detailed quantitative information with lower costs. CV-based evaluations enable therapists to quantify details of motor performance that are currently observed qualitatively. The development of such precise quantification methods will improve the science and practice of occupational therapy and allow therapists to perform to their full potential.
- Published
- 2020
- Full Text
- View/download PDF
15. Fine-grained optimization method for crystal structure prediction
- Author
-
Koji Tsuda, Tamio Oguchi, Tomoki Yamashita, and Kei Terayama
- Subjects
lcsh:Computer software ,Computer science ,Structure (category theory) ,02 engineering and technology ,Crystal structure ,021001 nanoscience & nanotechnology ,01 natural sciences ,Computer Science Applications ,Crystal structure prediction ,Random search ,Quadratic equation ,lcsh:QA76.75-76.765 ,Mechanics of Materials ,Modeling and Simulation ,0103 physical sciences ,lcsh:TA401-492 ,General Materials Science ,lcsh:Materials of engineering and construction. Mechanics of materials ,Relaxation (approximation) ,010306 general physics ,0210 nano-technology ,Look-ahead ,Algorithm - Abstract
Crystal structure prediction based on first-principles calculations is often achieved by applying relaxation to randomly generated initial structures. Relaxing a structure requires multiple optimization steps. It is time consuming to fully relax all the initial structures, but it is difficult to figure out which initial structure leads to the optimal solution in advance. In this paper, we propose a optimization method for crystal structure prediction, called Look Ahead based on Quadratic Approximation, that optimally assigns optimization steps to each candidate structure. It allows us to identify the most stable structure with a minimum number of total local optimization steps. Our simulations using known systems Si, NaCl, Y2Co17, Al2O3, and GaAs showed that the computational cost can be reduced significantly compared to random search. This method can be applied for controlling all kinds of local optimizations based on first-principles calculations to obtain best results under restricted computational resources. Speeding up how to predict atomic crystal structures is fundamental to predicting a new material’s physical properties. A team led by Kei Terayama and Koji Tsuda from the University of Tokyo devised a new and accelerated optimization method for crystal structure prediction where a large number of candidate atomic structures are generated, scored according to their lowest energies, and finally the local optimization of the structures with the lowest score is prioritized. They found that the total number of steps necessary to obtain the crystal structures of seven known systems, depending on the system, can be reduced by more than twenty times compared to random searching methods. This new approach to crystal structure prediction based on controlling local optimization steps may also help us, for example, identify new molecules.
- Published
- 2018
16. Machine learning accelerates MD-based binding pose prediction between ligands and proteins
- Author
-
Mitsugu Araki, Kei Terayama, Koji Tsuda, Yasushi Okuno, and Hiroaki Iwata
- Subjects
0301 basic medicine ,Statistics and Probability ,Binding free energy ,Protein Conformation ,Computer science ,Molecular Dynamics Simulation ,Ligands ,Machine learning ,computer.software_genre ,01 natural sciences ,Biochemistry ,Machine Learning ,03 medical and health sciences ,Molecular dynamics ,Drug Discovery ,0103 physical sciences ,Molecular Biology ,010304 chemical physics ,Drug discovery ,Ligand ,business.industry ,Computational Biology ,Proteins ,Original Papers ,Structural Bioinformatics ,Computer Science Applications ,Computational Mathematics ,030104 developmental biology ,Computational Theory and Mathematics ,Docking (molecular) ,Pose prediction ,Artificial intelligence ,business ,computer ,Protein Binding - Abstract
Motivation: Fast and accurate prediction of protein–ligand binding structures is indispensable for structure-based drug design and accurate estimation of binding free energy of drug candidate molecules in drug discovery. Recently, accurate pose prediction methods based on short Molecular Dynamics (MD) simulations, such as MM-PBSA and MM-GBSA, among generated docking poses have been used. Since molecular structures obtained from MD simulation depend on the initial condition, taking the average over different initial conditions leads to better accuracy. Prediction accuracy of protein–ligand binding poses can be improved with multiple runs at different initial velocity., Results: This paper shows that a machine learning method, called Best Arm Identification, can optimally control the number of MD runs for each binding pose. It allows us to identify a correct binding pose with a minimum number of total runs. Our experiment using three proteins and eight inhibitors showed that the computational cost can be reduced substantially without sacrificing accuracy. This method can be applied for controlling all kinds of molecular simulations to obtain best results under restricted computational resources.
- Published
- 2018
17. ChemTS: an efficient python library for de novo molecular generation
- Author
-
Kei Terayama, Koji Tsuda, Kazuki Yoshizoe, Jinzhe Zhang, and Xiufeng Yang
- Subjects
FOS: Computer and information sciences ,404 Materials informatics / Genomics ,Computer science ,lcsh:Biotechnology ,Monte Carlo tree search ,New topics/Others ,FOS: Physical sciences ,02 engineering and technology ,010402 general chemistry ,01 natural sciences ,Article ,Computational Engineering, Finance, and Science (cs.CE) ,Physics - Chemical Physics ,lcsh:TP248.13-248.65 ,lcsh:TA401-492 ,General Materials Science ,Computer Science - Computational Engineering, Finance, and Science ,60 New topics/Others ,computer.programming_language ,Chemical Physics (physics.chem-ph) ,Molecular design ,Artificial neural network ,Auto encoders ,Python (programming language) ,021001 nanoscience & nanotechnology ,Chemical space ,0104 chemical sciences ,Recurrent neural network ,recurrent neural network ,lcsh:Materials of engineering and construction. Mechanics of materials ,0210 nano-technology ,python library ,Algorithm ,computer - Abstract
Automatic design of organic materials requires black-box optimization in a vast chemical space. In conventional molecular design algorithms, a molecule is built as a combination of predetermined fragments. Recently, deep neural network models such as variational autoencoders and recurrent neural networks (RNNs) are shown to be effective in de novo design of molecules without any predetermined fragments. This paper presents a novel Python library ChemTS that explores the chemical space by combining Monte Carlo tree search and an RNN. In a benchmarking problem of optimizing the octanol-water partition coefficient and synthesizability, our algorithm showed superior efficiency in finding high-scoring molecules. ChemTS is available at https://github.com/tsudalab/ChemTS., Graphical Abstract
- Published
- 2017
18. RNA inverse folding using Monte Carlo tree search
- Author
-
Koji Tsuda, Kazuki Yoshizoe, Akito Taneda, and Xiufeng Yang
- Subjects
0301 basic medicine ,RNA Folding ,Fold (higher-order function) ,Computer science ,Monte Carlo tree search ,RNA inverse folding ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Inverse folding ,03 medical and health sciences ,User-Computer Interface ,Local update ,Structural Biology ,Molecule ,Nucleic acid structure ,Molecular Biology ,lcsh:QH301-705.5 ,Internet ,Sequence Analysis, RNA ,Applied Mathematics ,Methodology Article ,RNA ,Pseudoknotted structure ,Computer Science Applications ,030104 developmental biology ,lcsh:Biology (General) ,Nucleic Acid Conformation ,lcsh:R858-859.7 ,DNA microarray ,Pseudoknot ,Algorithm ,Monte Carlo Method ,GC-content ,Algorithms - Abstract
Background Artificially synthesized RNA molecules provide important ways for creating a variety of novel functional molecules. State-of-the-art RNA inverse folding algorithms can design simple and short RNA sequences of specific GC content, that fold into the target RNA structure. However, their performance is not satisfactory in complicated cases. Result We present a new inverse folding algorithm called MCTS-RNA, which uses Monte Carlo tree search (MCTS), a technique that has shown exceptional performance in Computer Go recently, to represent and discover the essential part of the sequence space. To obtain high accuracy, initial sequences generated by MCTS are further improved by a series of local updates. Our algorithm has an ability to control the GC content precisely and can deal with pseudoknot structures. Using common benchmark datasets for evaluation, MCTS-RNA showed a lot of promise as a standard method of RNA inverse folding. Conclusion MCTS-RNA is available at https://github.com/tsudalab/MCTS-RNA. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1882-7) contains supplementary material, which is available to authorized users.
- Published
- 2017
19. Machine Learning Algorithm for High-Order Interaction Modeling
- Author
-
Koji Tsuda, Ichiro Takeuchi, and Kazuya Nakagawa
- Subjects
Learning classifier system ,Computer science ,business.industry ,Active learning (machine learning) ,Algorithmic learning theory ,Stability (learning theory) ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Robot learning ,Computational learning theory ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Unsupervised learning ,Artificial intelligence ,Instance-based learning ,business ,computer ,0105 earth and related environmental sciences - Published
- 2017
20. MP-LAMP: parallel detection of statistically significant multi-loci markers on cloud platforms
- Author
-
Koji Tsuda, Kazuki Yoshizoe, and Aika Terada
- Subjects
0301 basic medicine ,Statistics and Probability ,Computer science ,Cloud computing ,0102 computer and information sciences ,Parallel computing ,01 natural sciences ,Biochemistry ,03 medical and health sciences ,Software ,Computer cluster ,Humans ,Search problem ,Molecular Biology ,Massively parallel ,SIMPLE (military communications protocol) ,business.industry ,Genetics and Population Analysis ,Cloud Computing ,Applications Notes ,Computer Science Applications ,Computational Mathematics ,Tree (data structure) ,ComputingMethodologies_PATTERNRECOGNITION ,030104 developmental biology ,Computational Theory and Mathematics ,010201 computation theory & mathematics ,Work stealing ,business ,Algorithms - Abstract
Summary Exhaustive detection of multi-loci markers from genome-wide association study datasets is a computationally challenging problem. This paper presents a massively parallel algorithm for finding all significant combinations of alleles and introduces a software tool termed MP-LAMP that can be easily deployed in a cloud platform, such as Amazon Web Service, as well as in an in-house computer cluster. Multi-loci marker detection is an unbalanced tree search problem that cannot be parallelized by simple tree-splitting using generic parallel programming frameworks, such as Map-Reduce. We employ work stealing and periodic reduce-broadcast to decrease the running time almost linearly to the number of cores. Availability and implementation MP-LAMP is available at https://github.com/tsudalab/mp-lamp. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2018
21. Application of Next-Generation Sequencing Analysis in the Directed Evolution for Creating Antibody Mimic
- Author
-
Hikaru Nakazawa, Hafumi Nishi, Thuy Duong Nguyen, Mitsuo Umetsu, Tomoshi Kameda, Tomoyuki Ito, Yutaka Saito, and Koji Tsuda
- Subjects
biology ,Computer science ,Biophysics ,biology.protein ,Computational biology ,Antibody ,Directed evolution ,DNA sequencing - Published
- 2021
22. Sparse modeling of EELS and EDX spectral imaging data by nonnegative matrix factorization
- Author
-
Toshiyuki Mori, Yuta Yamamoto, Shunsuke Muto, Takayoshi Tanji, Koji Tsuda, Kazuyoshi Tatsumi, and Motoki Shiga
- Subjects
010302 applied physics ,medicine.medical_specialty ,Chemical substance ,Computer science ,02 engineering and technology ,Spectral component ,021001 nanoscience & nanotechnology ,01 natural sciences ,Atomic and Molecular Physics, and Optics ,Electronic, Optical and Magnetic Materials ,Matrix decomposition ,Non-negative matrix factorization ,Spectral imaging ,Matrix (mathematics) ,Factorization ,Region of interest ,0103 physical sciences ,medicine ,0210 nano-technology ,Instrumentation ,Algorithm - Abstract
Advances in scanning transmission electron microscopy (STEM) techniques have enabled us to automatically obtain electron energy-loss (EELS)/energy-dispersive X-ray (EDX) spectral datasets from a specified region of interest (ROI) at an arbitrary step width, called spectral imaging (SI). Instead of manually identifying the potential constituent chemical components from the ROI and determining the chemical state of each spectral component from the SI data stored in a huge three-dimensional matrix, it is more effective and efficient to use a statistical approach for the automatic resolution and extraction of the underlying chemical components. Among many different statistical approaches, we adopt a non-negative matrix factorization (NMF) technique, mainly because of the natural assumption of non-negative values in the spectra and cardinalities of chemical components, which are always positive in actual data. This paper proposes a new NMF model with two penalty terms: (i) an automatic relevance determination (ARD) prior, which optimizes the number of components, and (ii) a soft orthogonal constraint, which clearly resolves each spectrum component. For the factorization, we further propose a fast optimization algorithm based on hierarchical alternating least-squares. Numerical experiments using both phantom and real STEM-EDX/EELS SI datasets demonstrate that the ARD prior successfully identifies the correct number of physically meaningful components. The soft orthogonal constraint is also shown to be effective, particularly for STEM-EELS SI data, where neither the spatial nor spectral entries in the matrices are sparse.
- Published
- 2016
23. COMBO: An efficient Bayesian optimization library for materials science
- Author
-
Zhufeng Hou, Koji Tsuda, Teruyasu Mizoguchi, Trevor David Rhone, and Tsuyoshi Ueno
- Subjects
010302 applied physics ,Hyperparameter ,Training set ,Computer science ,Bayesian optimization ,Scientific discovery ,02 engineering and technology ,Python (programming language) ,021001 nanoscience & nanotechnology ,computer.software_genre ,01 natural sciences ,0103 physical sciences ,General Materials Science ,Data mining ,0210 nano-technology ,Global optimization ,computer ,Thompson sampling ,Information Systems ,Cholesky decomposition ,computer.programming_language - Abstract
In many subfields of chemistry and physics, numerous attempts have been made to accelerate scientific discovery using data-driven experimental design algorithms. Among them, Bayesian optimization has been proven to be an effective tool. A standard implementation (e.g., scikit-learn), however, can accommodate only small training data. We designed an efficient protocol for Bayesian optimization that employs Thompson sampling, random feature maps, one-rank Cholesky update and automatic hyperparameter tuning, and implemented it as an open-source python library called COMBO (COMmon Bayesian Optimization library). Promising results using COMBO to determine the atomic structure of a crystalline interface are presented. COMBO is available at https://github.com/tsudalab/combo .
- Published
- 2016
24. Can Machine Learning Guide Directed Evolution of Functional Proteins
- Author
-
Misaki Oikawa, Yutaka Saito, Tomoshi Kameda, Mitsuo Umetsu, Koji Tsuda, T. J. Sato, and Hikaru Nakazawa
- Subjects
Human–computer interaction ,Computer science ,Biophysics ,Directed evolution - Published
- 2020
25. evERdock BAI: Machine-learning-guided selection of protein-protein complex structure
- Author
-
Koji Tsuda, Kazuhiro Takemura, Kei Terayama, Akio Kitao, and Ai Shinobu
- Subjects
Protein Conformation ,Computer science ,Complex system ,General Physics and Astronomy ,Molecular Dynamics Simulation ,010402 general chemistry ,Machine learning ,computer.software_genre ,01 natural sciences ,Machine Learning ,0103 physical sciences ,Reinforcement learning ,Physical and Theoretical Chemistry ,Representation (mathematics) ,Selection (genetic algorithm) ,010304 chemical physics ,business.industry ,Proteins ,Relaxation (iterative method) ,0104 chemical sciences ,Identification (information) ,Artificial intelligence ,Decoy ,business ,computer ,Energy (signal processing) ,Protein Binding - Abstract
Computational techniques for accurate and efficient prediction of protein-protein complex structures are widely used for elucidating protein-protein interactions, which play important roles in biological systems. Recently, it has been reported that selecting a structure similar to the native structure among generated structure candidates (decoys) is possible by calculating binding free energies of the decoys based on all-atom molecular dynamics (MD) simulations with explicit solvent and the solution theory in the energy representation, which is called evERdock. A recent version of evERdock achieves a higher-accuracy decoy selection by introducing MD relaxation and multiple MD simulations/energy calculations; however, huge computational cost is required. In this paper, we propose an efficient decoy selection method using evERdock and the best arm identification (BAI) framework, which is one of the techniques of reinforcement learning. The BAI framework realizes an efficient selection by suppressing calculations for nonpromising decoys and preferentially calculating for the promising ones. We evaluate the performance of the proposed method for decoy selection problems of three protein-protein complex systems. Their results show that computational costs are successfully reduced by a factor of 4.05 (in the best case) compared to a standard decoy selection approach without sacrificing accuracy.
- Published
- 2019
26. DESIGNING NANOSTRUCTURES FOR HEAT TRANSPORT VIA MATERIALS INFORMATICS
- Author
-
Koji Tsuda, Junichiro Shiomi, Thaer M. Dieb, and Shenghong Ju
- Subjects
Nanostructure ,Computer science ,Monte Carlo tree search ,Bayesian optimization ,Materials informatics ,Nanotechnology - Published
- 2018
27. LAMPLINK: detection of statistically significant SNP combinations from GWAS data
- Author
-
Aika Terada, Ryo Yamada, Koji Tsuda, and Jun Sese
- Subjects
0301 basic medicine ,Statistics and Probability ,Computer science ,Genome-wide association study ,Single-nucleotide polymorphism ,Computational biology ,Polymorphism, Single Nucleotide ,Biochemistry ,Set (abstract data type) ,03 medical and health sciences ,0302 clinical medicine ,Missing heritability problem ,Animals ,Humans ,SNP ,1000 Genomes Project ,Molecular Biology ,Genetic association ,Genome ,Genetics and Population Analysis ,Applications Notes ,Computer Science Applications ,Computational Mathematics ,030104 developmental biology ,Computational Theory and Mathematics ,Epistasis ,Software ,030217 neurology & neurosurgery ,Genome-Wide Association Study - Abstract
Summary: One of the major issues in genome-wide association studies is to solve the missing heritability problem. While considering epistatic interactions among multiple SNPs may contribute to solving this problem, existing software cannot detect statistically significant high-order interactions. We propose software named LAMPLINK, which employs a cutting-edge method to enumerate statistically significant SNP combinations from genome-wide case–control data. LAMPLINK is implemented as a set of additional functions to PLINK, and hence existing procedures with PLINK can be applicable. Applied to the 1000 Genomes Project data, LAMPLINK detected a combination of five SNPs that are statistically significantly accumulated in the Japanese population. Availability and Implementation: LAMPLINK is available at http://a-terada.github.io/lamplink/. Contact: terada@cbms.k.u-tokyo.ac.jp or sese.jun@aist.go.jp Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2016
28. Efficient recommendation tool of materials by an executable file based on machine learning
- Author
-
Ryo Tamura, Kei Terayama, and Koji Tsuda
- Subjects
010302 applied physics ,Physics and Astronomy (miscellaneous) ,Computer science ,Bayesian optimization ,General Engineering ,Materials informatics ,General Physics and Astronomy ,Sampling (statistics) ,computer.file_format ,computer.software_genre ,01 natural sciences ,0103 physical sciences ,Executable ,Ternary phase diagram ,Data mining ,computer - Abstract
To accelerate the discoveries of novel materials, an easy-to-use materials informatics tool is essential. We develop materials informatics applications, which can be executed on a Windows computer without any special settings. Our applications efficiently perform Bayesian optimization to optimize materials properties and uncertainty sampling to complete a new phase diagram. We introduce the usage of these applications and show the sampling results for a ternary phase diagram.
- Published
- 2019
29. Integration of sonar and optical camera images using deep neural network for fish monitoring
- Author
-
Koji Tsuda, Kei Terayama, Katsunori Mizuno, and Kento Shin
- Subjects
0106 biological sciences ,Optical camera ,Artificial neural network ,Computer science ,Sardinops melanostictus ,business.industry ,010604 marine biology & hydrobiology ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,04 agricultural and veterinary sciences ,Aquatic Science ,01 natural sciences ,Sonar ,040102 fisheries ,0401 agriculture, forestry, and fisheries ,%22">Fish ,Monochrome ,Computer vision ,Artificial intelligence ,Underwater ,business ,Fish resources - Abstract
Fish monitoring in aquaculture farms is indispensable for managing the growth and health status of fish resources. However, it is unrealistic to expect humans to be able to perform monitoring at night, when standard optical cameras are generally inapplicable. Although sonar systems can be used at night, their practical applications are limited by their monochrome, low-quality images. In this paper, we describe a realistic image generation system that uses sonar and camera images recorded at night. The proposed approach is based on conditional generative adversarial networks, which learn the image-to-image translation between sonar and optical images. We tested the system in a fish tank containing thousands of sardines (Sardinops melanostictus). Images were simultaneously recorded using high-precision imaging sonar and an underwater camera. Experimental results show that the proposed model successfully generates realistic daytime images from sonar and night camera images. Our system enables nighttime monitoring using sonar and an optical camera, leading to more efficient fish farming and environmental surveillance.
- Published
- 2019
30. An interpretable machine learning model for diagnosis of Alzheimer's disease
- Author
-
Junichi Ito, Diptesh Das, Koji Tsuda, and Tadashi Kadowaki
- Subjects
Bioinformatics ,Computer science ,Data Mining and Machine Learning ,Interpretable model ,Decision tree ,Alzheimer’s disease (AD) ,lcsh:Medicine ,02 engineering and technology ,Disease ,Machine learning ,computer.software_genre ,General Biochemistry, Genetics and Molecular Biology ,03 medical and health sciences ,0302 clinical medicine ,Machine learning model ,Neuroimaging ,ADNI ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Sparse high-order interaction ,Dementia ,Medical diagnosis ,Computer-aided diagnosis (CAD) model ,Cognitive Disorders ,Classification with rejection option ,Interpretability ,business.industry ,General Neuroscience ,lcsh:R ,Computational Biology ,Cost-effective framework ,General Medicine ,medicine.disease ,Precision medicine ,3. Good health ,SHIMR ,Cohort ,020201 artificial intelligence & image processing ,Artificial intelligence ,General Agricultural and Biological Sciences ,business ,computer ,030217 neurology & neurosurgery ,Neuroscience - Abstract
We present an interpretable machine learning model for medical diagnosis called sparse high-order interaction model with rejection option (SHIMR). A decision tree explains to a patient the diagnosis with a long rule (i.e., conjunction of many intervals), while SHIMR employs a weighted sum of short rules. Using proteomics data of 151 subjects in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, SHIMR is shown to be as accurate as other non-interpretable methods (Sensitivity, SN = 0.84 ± 0.1, Specificity, SP = 0.69 ± 0.15 and Area Under the Curve, AUC = 0.86 ± 0.09). For clinical usage, SHIMR has a function to abstain from making any diagnosis when it is not confident enough, so that a medical doctor can choose more accurate but invasive and/or more costly pathologies. The incorporation of a rejection option complements SHIMR in designing a multistage cost-effective diagnosis framework. Using a baseline concentration of cerebrospinal fluid (CSF) and plasma proteins from a common cohort of 141 subjects, SHIMR is shown to be effective in designing a patient-specific cost-effective Alzheimer’s disease (AD) pathology. Thus, interpretability, reliability and having the potential to design a patient-specific multistage cost-effective diagnosis framework can make SHIMR serve as an indispensable tool in the era of precision medicine that can cater to the demand of both doctors and patients, and reduce the overwhelming financial burden of medical diagnosis.
- Published
- 2019
31. Efficient error-tolerant query autocompletion
- Author
-
Kunihiko Sadakane, Wei Wang, Jianbin Qin, Yoshiharu Ishikawa, Koji Tsuda, and Chuan Xiao
- Subjects
Web search query ,Theoretical computer science ,Computer science ,General Engineering ,Online aggregation ,Range query (database) ,Query optimization ,computer.software_genre ,Query language ,Ranking (information retrieval) ,Query expansion ,Web query classification ,Trie ,Query by Example ,Edit distance ,Sargable ,Data mining ,computer ,Boolean conjunctive query ,computer.programming_language - Abstract
Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users' input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distance from the query are within the threshold. The major inherent problem is that the number of such prefixes is huge for the first few characters of the query and is exponential in the alphabet size. This results in slow query response even if the entire query approximately matches only few prefixes. In this paper, we propose a novel neighborhood generation-based algorithm, IncNGTrie, which can achieve up to two orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem. Our proposed algorithm only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal which is a core problem in fetching query answers. In addition, we propose optimization techniques to reduce our index size, as well as discussions on several extensions to our method. The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.
- Published
- 2013
32. Fast Iterative Mining Using Sparsity-Inducing Loss Functions
- Author
-
Hiroto Saigo, Hisashi Kashima, and Koji Tsuda
- Subjects
Discriminative pattern mining ,Computer science ,business.industry ,Pattern recognition ,Machine learning ,computer.software_genre ,Regression ,Artificial Intelligence ,Hardware and Architecture ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,computer ,Software - Published
- 2013
33. Safe Pattern Pruning: An Efficient Approach for Predictive Pattern Mining
- Author
-
Kazuya Nakagawa, Ichiro Takeuchi, Shinya Suzumura, Masayuki Karasuyama, and Koji Tsuda
- Subjects
FOS: Computer and information sciences ,Class (computer programming) ,business.industry ,Property (programming) ,Computer science ,Node (networking) ,InformationSystems_DATABASEMANAGEMENT ,Machine Learning (stat.ML) ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Tree (data structure) ,Statistics - Machine Learning ,Convex optimization ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Artificial intelligence ,Data mining ,Pruning (decision trees) ,business ,computer ,0105 earth and related environmental sciences - Abstract
In this paper we study predictive pattern mining problems where the goal is to construct a predictive model based on a subset of predictive patterns in the database. Our main contribution is to introduce a novel method called safe pattern pruning (SPP) for a class of predictive pattern mining problems. The SPP method allows us to efficiently find a superset of all the predictive patterns in the database that are needed for the optimal predictive model. The advantage of the SPP method over existing boosting-type method is that the former can find the superset by a single search over the database, while the latter requires multiple searches. The SPP method is inspired by recent development of safe feature screening. In order to extend the idea of safe feature screening into predictive pattern mining, we derive a novel pruning rule called safe pattern pruning (SPP) rule that can be used for searching over the tree defined among patterns in the database. The SPP rule has a property that, if a node corresponding to a pattern in the database is pruned out by the SPP rule, then it is guaranteed that all the patterns corresponding to its descendant nodes are never needed for the optimal predictive model. We apply the SPP method to graph mining and item-set mining problems, and demonstrate its computational advantage.
- Published
- 2016
34. Significant Pattern Mining with Confounding Variables
- Author
-
Aika Terada, David duVerle, and Koji Tsuda
- Subjects
Computer science ,Confounding ,Word error rate ,02 engineering and technology ,Logistic regression ,01 natural sciences ,Synthetic data ,010104 statistics & probability ,Outcome variable ,020204 information systems ,Statistical significance ,Multiple comparisons problem ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Testability - Abstract
Recent pattern mining algorithms such as LAMP allow us to compute statistical significance of patterns with respect to an outcome variable. Their p-values are adjusted to control the family-wise error rate, which is the probability of at least one false discovery occurring. However, they are a poor fit for medical applications, due to their inability to handle potential confounding variables such as age or gender. We propose a novel pattern mining algorithm that evaluates statistical significance under confounding variables. Using a new testability bound based on the exact logistic regression model, the algorithm can exclude a large quantity of combination without testing them, limiting the amount of correction required for multiple testing. Using synthetic data, we showed that our method could remove the bias introduced by confounding variables while still detecting true patterns correlated with the class. In addition, we demonstrated application of data integration using a confounding variable.
- Published
- 2016
35. Privacy-preserving search for chemical compound databases
- Author
-
Kiyoshi Asai, Hiromi Arai, Shigeo Mitsunari, Nuttapong Attrapadung, Takatsugu Hirokawa, Kana Shimizu, Michiaki Hamada, Koji Tsuda, Goichiro Hanaoka, Koji Nuida, and Jun Sakuma
- Subjects
Similarity Search ,Computer science ,Nearest neighbor search ,Cryptography ,Encryption ,computer.software_genre ,Biochemistry ,Text mining ,Structural Biology ,Tversky Index ,Cryptosystem ,Molecular Biology ,Computer Security ,Database server ,Web search query ,Information retrieval ,Additive Homomorphic Cryptosystem ,Database ,business.industry ,Research ,Applied Mathematics ,Cloud Computing ,Cryptographic protocol ,Privacy Preserving Data Mining ,Computer Science Applications ,Information sensitivity ,Chemical Compound ,Data mining ,business ,computer ,Algorithms ,Databases, Chemical - Abstract
Background Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources. Results In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation. Conclusion We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.
- Published
- 2015
36. PDB-scale analysis of known and putative ligand-binding sites with structural sketches
- Author
-
Kentaro Tomii, Koji Tsuda, Jun Ichi Ito, Yasuo Tabei, and Kana Shimizu
- Subjects
Models, Molecular ,Proteomics ,Binding Sites ,Protein family ,Computer science ,Nearest neighbor search ,Protein Data Bank (RCSB PDB) ,Proteins ,Computational biology ,Ligands ,Biochemistry ,Structural genomics ,Structural bioinformatics ,Structural Biology ,Pairwise comparison ,Binding site ,Databases, Protein ,Molecular Biology ,Time complexity ,Algorithm ,Algorithms - Abstract
Computational investigation of protein functions is one of the most urgent and demanding tasks in the field of structural bioinformatics. Exhaustive pairwise comparison of known and putative ligand-binding sites, across protein families and folds, is essential in elucidating the biological functions and evolutionary relationships of proteins. Given the vast amounts of data available now, existing 3D structural comparison methods are not adequate due to their computation time complexity. In this article, we propose a new bit string representation of binding sites called structural sketches, which is obtained by random projections of triplet descriptors. It allows us to use ultra-fast all-pair similarity search methods for strings with strictly controlled error rates. Exhaustive comparison of 1.2 million known and putative binding sites finished in ∼30 h on a single core to yield 88 million similar binding site pairs. Careful investigation of 3.5 million pairs verified by TM-align revealed several notable analogous sites across distinct protein families or folds. In particular, we succeeded in finding highly plausible functions of several pockets via strong structural analogies. These results indicate that our method is a promising tool for functional annotation of binding sites derived from structural genomics projects.
- Published
- 2011
37. SketchSort: Fast All Pairs Similarity Search for Large Databases of Molecular Fingerprints
- Author
-
Yasuo Tabei and Koji Tsuda
- Subjects
Database ,Computer science ,Random projection ,Nearest neighbor search ,Organic Chemistry ,Sorting ,Scale (descriptive set theory) ,computer.software_genre ,Symbol (chemistry) ,Computer Science Applications ,Similarity (network science) ,Structural Biology ,Drug Discovery ,Molecular Medicine ,Pairwise comparison ,computer ,Time complexity - Abstract
Similarity networks of ligands are often reported useful in predicting chemical activities and target proteins. However, the naive method of computing all pairwise similarities of chemical fingerprints takes quadratic time, which is prohibitive for large scale databases with millions of ligands. We propose a fast all pairs similarity search method, called SketchSort, that maps chemical fingerprints to symbol strings with random projections, and finds similar strings by multiple masked sorting. Due to random projection, SketchSort misses a certain fraction of neighbors (i.e., false negatives). Nevertheless, the expected fraction of false negatives is theoretically derived and can be kept under a very small value. Experiments show that SketchSort is much faster than other similarity search methods and enables us to obtain a PubChem-scale similarity network quickly.
- Published
- 2011
38. SlideSort: all pairs similarity search for short reads
- Author
-
Koji Tsuda and Kana Shimizu
- Subjects
Statistics and Probability ,Base Sequence ,Computer science ,Nearest neighbor search ,String (computer science) ,Single-linkage clustering ,Computational Biology ,Sequence assembly ,Sequence Analysis, DNA ,computer.software_genre ,Original Papers ,Biochemistry ,Computer Science Applications ,Computational Mathematics ,Exact algorithm ,Computational Theory and Mathematics ,Edit distance ,Data mining ,Sequence Analysis ,Molecular Biology ,Algorithm ,computer ,Algorithms ,Software - Abstract
Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing. Availability: Executable binary files and C++ libraries are available at http://www.cbrc.jp/~shimizu/slidesort/ for Linux and Windows. Contact: slidesort@m.aist.go.jp; shimizu-kana@aist.go.jp Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2010
39. Cartesian Kernel : An Efficient Alternative to the Pairwise Kernel
- Author
-
Yoshihiro Yamanishi, Hisashi Kashima, Satoshi Oyama, and Koji Tsuda
- Subjects
Graph kernel ,Computer science ,Kernel principal component analysis ,pairwise kernels ,symbols.namesake ,Matrix (mathematics) ,Kernel (linear algebra) ,kernel methods ,Artificial Intelligence ,String kernel ,Polynomial kernel ,Kronecker delta ,ComputingMethodologies_SYMBOLICANDALGEBRAICMANIPULATION ,Adjacency matrix ,Electrical and Electronic Engineering ,link prediction ,Discrete mathematics ,Kronecker product ,Kernel (set theory) ,Graph ,Kernel method ,Hardware and Architecture ,Kernel embedding of distributions ,Variable kernel density estimation ,Kernel (statistics) ,Radial basis function kernel ,symbols ,Kernel smoother ,Computer Vision and Pattern Recognition ,Kernel Fisher discriminant analysis ,Tree kernel ,Software ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
Pairwise classification has many applications including network prediction, entity resolution, and collaborative filtering. The pairwise kernel has been proposed for those purposes by several research groups independently, and has been used successfully in several fields. In this paper, we propose an efficient alternative which we call a Cartesian kernel. While the existing pairwise kernel (which we refer to as the Kronecker kernel) can be interpreted as the weighted adjacency matrix of the Kronecker product graph of two graphs, the Cartesian kernel can be interpreted as that of the Cartesian graph, which is more sparse than the Kronecker product graph. We discuss the generalization bounds of the two pairwise kernels by using eigenvalue analysis of the kernel matrices. Also, we consider the N-wise extensions of the two pairwise kernels. Experimental results show the Cartesian kernel is much faster than the Kronecker kernel, and at the same time, competitive with the Kronecker kernel in predictive performance.
- Published
- 2010
40. DenseZDD : A Compact and Fast Index for Families of Sets
- Author
-
Shin-ichi Minato, Hiroki Arimura, Kunihiko Sadakane, Shuhei Denzumi, Jun Kawahara, and Koji Tsuda
- Subjects
Theoretical computer science ,Current (mathematics) ,lcsh:T55.4-60.8 ,zero-suppressed binary decision diagram ,Computer science ,0102 computer and information sciences ,02 engineering and technology ,set family ,Space (commercial competition) ,Type (model theory) ,computer.software_genre ,01 natural sciences ,lcsh:QA75.5-76.95 ,Theoretical Computer Science ,Succinct data structure ,Set (abstract data type) ,Web information ,succinct data structure ,0202 electrical engineering, electronic engineering, information engineering ,Set operations ,lcsh:Industrial engineering. Management engineering ,Boolean function ,Mathematics ,Numerical Analysis ,Binary decision diagram ,020207 software engineering ,Primitive operation ,Data structure ,Computational Mathematics ,Computational Theory and Mathematics ,Index (publishing) ,010201 computation theory & mathematics ,Data mining ,lcsh:Electronic computers. Computer science ,computer ,Information integration - Abstract
In many real-life problems, we are often faced with manipulating families of sets. Manipulation of large-scale set families is one of the important fundamental techniques for web information retrieval, integration, and mining. For this purpose, a special type of binary decision diagrams (BDDs), called Zero-suppressed BDDs (ZDDs), is used. However, current techniques for storing ZDDs require a huge amount of memory and membership operations are slow. This paper introduces DenseZDD, a compressed index for static ZDDs. Our technique not only indexes set families compactly but also executes fast member membership operations. We also propose a hybrid method of DenseZDD and ordinary ZDDs to allow for dynamic indices., SEA 2014 : 13th International Symposium , Jun 29-Jul 1, 2014 , Copenhagen, Denmark
- Published
- 2018
41. Transfer Learning to Accelerate Interface Structure Searches
- Author
-
Shin Kiyohara, Hiromi Oda, Teruyasu Mizoguchi, and Koji Tsuda
- Subjects
010302 applied physics ,Computer science ,Interface (Java) ,Structure (category theory) ,General Physics and Astronomy ,02 engineering and technology ,021001 nanoscience & nanotechnology ,01 natural sciences ,Calculation methods ,Computational science ,Kriging ,Factor (programming language) ,0103 physical sciences ,Grain boundary ,0210 nano-technology ,Material properties ,Transfer of learning ,computer ,computer.programming_language - Abstract
Interfaces have atomic structures that are significantly different from those in the bulk, and play crucial roles in material properties. The central structures at the interfaces that provide properties have been extensively investigated. However, determination of even one interface structure requires searching for the stable configuration among many thousands of candidates. Here, a powerful combination of machine learning techniques based on kriging and transfer learning (TL) is proposed as a method for unveiling the interface structures. Using the kriging+TL method, thirty-three grain boundaries were systematically determined from 1,650,660 candidates in only 462 calculations, representing an increase in efficiency over conventional all-candidate calculation methods, by a factor of approximately 3,600.
- Published
- 2017
42. gBoost: a mathematical programming approach to graph classification and regression
- Author
-
Sebastian Nowozin, Tadashi Kadowaki, Hiroto Saigo, Koji Tsuda, and Taku Kudo
- Subjects
Quantitative structure–activity relationship ,Mathematical optimization ,Boosting (machine learning) ,Computer science ,business.industry ,Computation ,Supervised learning ,Machine learning ,computer.software_genre ,Regression ,Artificial Intelligence ,Search algorithm ,AdaBoost ,Artificial intelligence ,Distributed File System ,business ,computer ,Software - Abstract
Graph mining methods enumerate frequently appearing subgraph patterns, which can be used as features for subsequent classification or regression. However, frequent patterns are not necessarily informative for the given learning problem. We propose a mathematical programming boosting method (gBoost) that progressively collects informative patterns. Compared to AdaBoost, gBoost can build the prediction rule with fewer iterations. To apply the boosting method to graph data, a branch-and-bound pattern search algorithm is developed based on the DFS code tree. The constructed search space is reused in later iterations to minimize the computation time. Our method can learn more efficiently than the simpler method based on frequent substructure mining, because the output labels are used as an extra information source for pruning the search space. Furthermore, by engineering the mathematical program, a wide range of machine learning problems can be solved without modifying the pattern search algorithm.
- Published
- 2008
43. Privacy-Preserving Statistical Analysis by Exact Logistic Regression
- Author
-
Koji Tsuda, Jun Sakuma, David duVerle, Shohei Kawasaki, and Yoshiji Yamada
- Subjects
Exact statistics ,Protocol (science) ,Information privacy ,Computer science ,Confounding ,Sampling (statistics) ,Regression analysis ,Data mining ,Logistic regression ,computer.software_genre ,computer ,Statistical hypothesis testing - Abstract
Logistic regression is the method of choice in most genome-wide association studies (GWAS). Due to the heavy cost of performing iterative parameter updates when training such a model, existing methods have prohibitive communication and computational complexities that make them unpractical for real-life usage. We propose a new sampling-based secure protocol to compute exact statistics, that requires a constant number of communication rounds and a much lower number of computations. The publicly available implementation of our protocol (and its many optional optimisations adapted to different security scenarios) can, in a matter of hours, perform statistical testing of over 600 SNP variables across thousands of patients while accounting for potential confounding factors in the clinical data.
- Published
- 2015
44. BDD construction for all solutions SAT and efficient caching mechanism
- Author
-
Koji Tsuda and Takahisa Toda
- Subjects
Computer Science::Performance ,Mechanism (engineering) ,Computer Science::Hardware Architecture ,Unit propagation ,Computer science ,Binary decision diagram ,True quantified Boolean formula ,Computer Science::Logic in Computer Science ,Computer Science::Computational Complexity ,Boolean satisfiability problem ,Algorithm ,Hardware_LOGICDESIGN - Abstract
We improve an existing OBDD-based method of computing all total satisfying assignments of a Boolean formula, where an OBDD means an ordered binary decision diagram that is not necessarily reduced. To do this, we introduce lazy caching and finer caching by effectively using unit propagation. We implement our methods on top of a modern SAT solver, and show by experiments that lazy caching significantly accelerates the original method and finer caching in turn reduces an OBDD size.
- Published
- 2015
45. Superset Generation on Decision Diagrams
- Author
-
Shogo Takeuchi, Shin-ichi Minato, Takahisa Toda, and Koji Tsuda
- Subjects
Discrete mathematics ,Set (abstract data type) ,Variable (computer science) ,Monotone boolean function ,Binary decision diagram ,Computer science ,Computer Science::Logic in Computer Science ,Subset and superset ,Computer Science::Artificial Intelligence ,Data structure ,Upper and lower bounds - Abstract
Generating all supersets from a given set family is important, because it is closely related to identifying cause-effect relationship. This paper presents an efficient method for superset generation by using the compressed data structures BDDs and ZDDs effectively. We analyze the size of a BDD that represents all supersets. As a by-product, we obtain a non-trivial upper bound for the size of a BDD that represents a monotone Boolean function in a fixed variable ordering.
- Published
- 2015
46. Modeling splicing sites with pairwise correlations
- Author
-
Kiyoshi Asai, Koji Tsuda, and Masanori Arita
- Subjects
Statistics and Probability ,Source code ,Computer science ,media_common.quotation_subject ,Statistics as Topic ,Machine learning ,computer.software_genre ,Markov model ,Biochemistry ,Sequence Homology, Nucleic Acid ,Humans ,Computer Simulation ,Base Pairing ,Molecular Biology ,media_common ,Models, Statistical ,Models, Genetic ,Genome, Human ,business.industry ,Chromosome Mapping ,Sequence Analysis, DNA ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,RNA splicing ,Pairwise comparison ,RNA Splice Sites ,Artificial intelligence ,business ,Sequence Alignment ,computer ,Algorithms ,Software - Abstract
Motivation: A new method for finding subtle patterns in sequences is introduced. It approximates the multiple correlations among residuals with pair-wise correlations, with the learning cost O(m2n) where n is the number of training sequences, each of length m. The method suits to model splicing sites in human DNA, which are reported to have higher-order dependencies. Results: By computational experiments, the prediction accuracy of our model was shown to surpass that of previously reported Markov models for the prediction of acceptor sites in human. Availability: The C++ source code is available on request from the authors. Contact: m-arita@aist.go.jp
- Published
- 2002
47. Oblivious Evaluation of Non-deterministic Finite Automata with Application to Privacy-Preserving Virus Genome Detection
- Author
-
Hiroki Harada, Hiroki Arimura, David duVerle, Koji Tsuda, Jun Sakuma, and Hirohito Sasakawa
- Subjects
TheoryofComputation_COMPUTATIONBYABSTRACTDEVICES ,Finite-state machine ,Theoretical computer science ,Computer science ,Powerset construction ,String searching algorithm ,Automaton ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Deterministic finite automaton ,DFA minimization ,Nondeterministic finite automaton ,Generalized nondeterministic finite automaton ,Algorithm ,Computer Science::Formal Languages and Automata Theory ,Computer Science::Cryptography and Security - Abstract
Various string matching problems can be solved by means of a deterministic finite automaton (DFA) or a non-deterministic finite automaton (NFA). In non-oblivious cases, DFAs are often preferred for their run-time efficiency despite larger sizes. In oblivious cases, however, the inevitable computation and communication costs associated with the automaton size are more favorable to NFAs. We propose oblivious protocols for NFA evaluation based on homomorphic encryption and demonstrate that our method can be orders of magnitude faster than DFA-based methods, making it applicable to real-life scenarios, such as privacy-preserving detection of viral infection using genomic data.
- Published
- 2014
48. A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration
- Author
-
Aika Terada, Shin-ichi Minato, Koji Tsuda, Jun Sese, and Takeaki Uno
- Subjects
Computer science ,Factor (programming language) ,Scientific discovery ,Enumeration ,Data mining ,Arity ,Threshold function ,computer.software_genre ,computer ,Database transaction ,Data mining algorithm ,computer.programming_language - Abstract
In many scientific communities using experiment databases, one of the crucial problems is how to assess the statistical significance (p-value) of a discovered hypothesis. Especially, combinatorial hypothesis assessment is a hard problem because it requires a multiple-testing procedure with a very large factor of the p-value correction. Recently, Terada et al. proposed a novel method of the p-value correction, called "Limitless Arity Multiple-testing Procedure" (LAMP), which is based on frequent itemset enumeration to exclude meaninglessly infrequent itemsets which will never be significant. The LAMP makes much more accurate p-value correction than previous method, and it empowers the scientific discovery. However, the original LAMP implementation is sometimes too time-consuming for practical databases. We propose a new LAMP algorithm that essentially executes itemset mining algorithm once, while the previous one executes many times. Our experimental results show that the proposed method is much (10 to 100 times) faster than the original LAMP. This algorithm enables us to discover significant p-value patterns in quite short time even for very large-scale databases.
- Published
- 2014
49. An In Silico Model for Interpreting Polypharmacology in Drug–Target Networks
- Author
-
Koji Tsuda, Hiroshi Mamitsuka, and Ichigaku Takigawa
- Subjects
Sequence ,Computer science ,In silico ,Drug target ,Polypharmacology ,Computational biology ,Bioinformatics ,DrugBank - Abstract
Recent analysis on polypharmacology leads to the idea that only small fragments of drugs and targets are a key to understanding their interactions forming polypharmacology. This idea motivates us to build an in silico approach of finding significant substructure patterns from drug-target (molecular graph-amino acid sequence) pairs. This article introduces an efficient in silico method for enumerating, from given drug-target pairs, all frequent subgraph-subsequence pairs, which can then be further examined by hypothesis testing for statistical significance. Unique features of the method are its scalability, computational efficiency, and technical soundness in terms of computer science and statistics. The presented method was applied to 11,219 drug-target pairs in DrugBank to obtain significant substructure pairs, which can divide most of the original 11,219 pairs into eight highly exclusive clusters, implying that the obtained substructure pairs are indispensable components for interpreting polypharmacology.
- Published
- 2013
50. Acceleration of stable interface structure searching using a kriging approach
- Author
-
Shin Kiyohara, Koji Tsuda, Hiromi Oda, and Teruyasu Mizoguchi
- Subjects
010302 applied physics ,Computer science ,General Engineering ,General Physics and Astronomy ,02 engineering and technology ,Geostatistics ,021001 nanoscience & nanotechnology ,01 natural sciences ,Computational science ,Brute force ,Kriging ,Computational chemistry ,Lattice (order) ,0103 physical sciences ,0210 nano-technology ,Material properties - Abstract
Crystalline interfaces have a tremendous impact on the properties of materials. Determination of the atomic structure of the interface is crucial for a comprehensive understanding of the interface properties. Despite this importance, extensive calculation is necessary to determine even one interface structure. In this study, we apply a technique called kriging, borrowed from geostatistics, to accelerate the determination of the interface structure. The atomic structure of simplified coincidence-site lattice interfaces were determined using the kriging approach. Our approach successfully determined the most stable interface structure with an efficiency almost 2 orders of magnitude better than the traditional “brute force” approach.
- Published
- 2016
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.