Author: "Ross D. King" / Topic: computer science - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Ross D. King"' showing total 82 results

Start Over Author "Ross D. King" Topic computer science

82 results on '"Ross D. King"'

1. Transformational machine learning

Author: Andy M Davis, Ross D. King, Ivan Olier, Joaquin Vanschoren, Larisa N. Soldatova, Oghenejokpeme I. Orhobor, Tirtharaj Dash, and Data Mining
Subjects: QA75, Multitask learning, Computer science, Multi-task learning, Machine learning, computer.software_genre, Drug design, QA76, Robustness (computer science), Representation (mathematics), Multidisciplinary, Artificial neural network, business.industry, Statistics, Random forest, Transfer learning, Support vector machine, Stacking, AI, Physical Sciences, Gradient boosting, Artificial intelligence, Transfer of learning, business, computer
Abstract: Significance Machine learning (ML) is the branch of artificial intelligence (AI) that develops computational systems that learn from experience. In supervised ML, the ML system generalizes from labelled examples to learn a model that can predict the labels of unseen examples. Examples are generally represented using features that directly describe the examples. For instance, in drug design, ML uses features that describe molecular shape and so on. In cases where there are multiple related ML problems, it is possible to use a different type of feature: predictions made about the examples by ML models learned on other problems. We call this transformational ML. We show that this results in better predictions and improved understanding when applied to scientific problems., Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).
Published: 2021

2. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

Author: Oghenejokpeme I. Orhobor, Nastasiya F. Grinberg, and Ross D. King
Subjects: 0106 biological sciences, Elastic net regularization, Computer science, Lasso regression, Best linear unbiased prediction, Machine learning, computer.software_genre, 01 natural sciences, Article, 03 medical and health sciences, Lasso (statistics), Artificial Intelligence, Linear regression, GWAS, BLUP, Statistical genetics, 030304 developmental biology, Plant biology, 2. Zero hunger, 0303 health sciences, Support vector machines, business.industry, Gradient boosting machines, Missing data, Random forest, Support vector machine, Ridge regression, Artificial intelligence, Gradient boosting, business, computer, Software, 010606 plant biology & botany
Abstract: In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Published: 2019
Full Text: View/download PDF

3. NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Author: Andrey Rzhetsky, Ross D. King, Emily Sheng, Joel Matthew, Weidi Pan, James A. Evans, Fenia Christopoulou, Yu Li, Larisa N. Soldatova, Sahil Garg, José Luis Ambite, Ulf Hermjakob, Kanix Wang, Daniel Marcu, Halima Alachram, Brendan Chambers, Sophia Ananiadou, Annika Marie Schoene, Robert Stevens, Xin Gao, Aram Galstyan, Bohdan B. Khomtchouk, Maolin Li, Tim Beißbarth, Edgar Wingender, Wang, Kanix [0000-0003-1355-577X], Li, Yu [0000-0002-3664-6722], Soldatova, Larisa [0000-0001-6489-3029], Li, Maolin [0000-0002-0828-2001], Ambite, José Luis [0000-0003-0087-080X], Gao, Xin [0000-0002-7108-3574], Khomtchouk, Bohdan B. [0000-0001-9607-7528], Evans, James A. [0000-0001-9838-0707], Rzhetsky, Andrey [0000-0001-6959-7405], Apollo - University of Cambridge Repository, Khomtchouk, Bohdan B [0000-0001-9607-7528], and Evans, James A [0000-0001-9838-0707]
Subjects: Computer science, QH301-705.5, media_common.quotation_subject, Diseases, Ontology (information science), computer.software_genre, General Biochemistry, Genetics and Molecular Biology, Article, Bridging (programming), Annotation, 3102 Bioinformatics and Computational Biology, Knowledge extraction, Named-entity recognition, Drug Discovery, Biology (General), Biomedicine, media_common, 692/699, business.industry, Applied Mathematics, Ambiguity, Computer Science Applications, Networking and Information Technology R&D (NITRD), Modeling and Simulation, Embedding, Artificial intelligence, 631/1647/794, business, computer, Natural language processing, Software, 31 Biological Sciences
Abstract: Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.
Published: 2021

4. NERO: A Biomedical Named-entity (Recognition) Ontology with a Large, Annotated Corpus Reveals Meaningful Associations Through Text Embedding

Author: Aram Galstyan, Halima Alachram, Fenia Christopoulou, Edgar Wingender, Tim Beißbarth, Ross D. King, James A. Evans, Sophia Ananiadou, Emily Sheng, Bohdan B. Khomtchouk, Daniel Marcu, Maolin Li, Xin Gao, Brendan Chambers, Robert Stevens, Yu Li, Sahil Garg, Ulf Hermjakob, Larisa N. Soldatova, Kanix Wang, Andrey Rzhetsky, and José Luis Ambite
Subjects: Machine vision, Computer science, business.industry, media_common.quotation_subject, 02 engineering and technology, Ambiguity, Ontology (information science), computer.software_genre, 3. Good health, Bridging (programming), 03 medical and health sciences, Annotation, 0302 clinical medicine, Knowledge extraction, Named-entity recognition, 0202 electrical engineering, electronic engineering, information engineering, Ontology, 020201 artificial intelligence & image processing, 030212 general & internal medicine, Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: Machine reading is essential for unlocking valuable knowledge contained in the millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in machine-reading have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in machine reading methodology and automated knowledge extraction systems in the same way that ImageNet4was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named-entity analysis tool for biomedicine: (a) a new, Named-Entity Recognition Ontology (NERO) developed specifically for describing entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named-entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named-entity recognition automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.
Published: 2020
Full Text: View/download PDF

5. Self-supervised learning of object slippage: An LSTM model trained on low-cost tactile sensors

Author: Riza Theresa Batista-Navarro, Ainur Begalinova, Ross D. King, and Barry Lennox
Subjects: Recurrent neural network, business.industry, Computer science, Event (computing), Process (computing), Wearable computer, Robot, Computer vision, Artificial intelligence, business, Object (computer science), Tactile sensor, Slip (vehicle dynamics)
Abstract: This paper presents a combination of machine learning techniques for slip detection in grasping, based on temporal features collected by low-cost tactile sensors. A slippage is an event that is subsequent to prior micro-slippages that have occurred at hand-object contact. The method is based on the application of a sequential classification technique (a variant of recurrent neural networks known as long short-term memory networks or LSTMs), whereby time-series pressure readings from tactile sensors are classified as either slip or non-slip events. We also propose a novel method for autonomous labelling, removing the need for humans in the labelling process. Lastly, this paper proposes a new design for an adaptable wearable tactile sensing device that integrates non-expensive sensors. Our proposed method achieved high accuracy in the classification of slip and non-slip events, obtaining over 95% in offline classification and 89% in online classification using a Sawyer robot.
Published: 2020
Full Text: View/download PDF

6. Using Prior Knowledge to Facilitate Computational Reading of Arabic Calligraphy

Author: Riza Theresa Batista-Navarro, Seetah ALSalamah, and Ross D. King
Subjects: 050101 languages & linguistics, Maximally stable extremal regions, Computer science, Arabic, media_common.quotation_subject, Image processing, 02 engineering and technology, computer.software_genre, 46 Information and Computing Sciences, Reading (process), 0202 electrical engineering, electronic engineering, information engineering, 0501 psychology and cognitive sciences, media_common, business.industry, 05 social sciences, language.human_language, Cultural heritage, Calligraphy, Writing system, Pattern recognition (psychology), language, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing
Abstract: Arabic calligraphy (AC) is central to Arabic cultural heritage and has been used since its introduction, with the first writing of the Holy Quran, up until the present. It is famous for the artistic and complicated ways that letters and words interweave and intertwine to express textual statements – usually quotations from the Quran. These specifications make it probably the hardest of all human writing systems to read. Here, we introduce the challenge of reading Arabic calligraphy using artificial intelligence (AI), a challenge that combines image processing and understanding of texts. We have collected a corpus of 1000 AC images along with annotated quotations from the Quran, pre-processing the images and identifying individual letters using detection methods based on maximally stable extremal regions (MSERs) and sliding windows (SWs). We then collect the identified letters to form bags of extracted letters (BOLs). These BOLs are then used to search for possible quotation from the corpus. Our results show that MSERs outperforms SWs in letter detection. Furthermore, BOL-matching is better than word generation in predicting the correct quotation, with the correct answer found in the list of 10 topmost matches for more than 74% of the 388 test examples.
Published: 2020
Full Text: View/download PDF

7. Multi-task learning with a natural metric for quantitative structure activity relationship learning

Author: Jérémy Besnard, Joaquin Vanschoren, Crina Grosan, Jan N. van Rijn, Ross D. King, Noureddin Sadawi, Ivan Olier, Larisa N. Soldatova, G. Richard J. Bickerton, Data Mining, Soldatova, Larisa [0000-0001-6489-3029], and Apollo - University of Cambridge Repository
Subjects: Quantitative structure–activity relationship, Computer science, media_common.quotation_subject, education, multi-task learning, Multi-task learning, Sequence-based similarity, quantitative structure activity relationship, 02 engineering and technology, Library and Information Sciences, Machine learning, computer.software_genre, sequence-based similarity, lcsh:Chemistry, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Physical and Theoretical Chemistry, QA, Function (engineering), 030304 developmental biology, media_common, 0303 health sciences, lcsh:T58.5-58.64, lcsh:Information technology, business.industry, chEMBL, Computer Graphics and Computer-Aided Design, Computer Science Applications, Random forest, Drug activity, lcsh:QD1-999, Quantitative structure activity relationship, Metric (mathematics), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, random forest, psychological phenomena and processes, Research Article
Abstract: © The Author(s) 2019. The goal of quantitative structure activity relationship (QSAR) learning is to learn a function that, given the structure of a small molecule (a potential drug), outputs the predicted activity of the compound. We employed multi-task learning (MTL) to exploit commonalities in drug targets and assays. We used datasets containing curated records about the activity of specific compounds on drug targets provided by ChEMBL. Totally, 1091 assays have been analysed. As a baseline, a single task learning approach that trains random forest to predict drug activity for each drug target individually was considered. We then carried out feature-based and instance-based MTL to predict drug activities. We introduced a natural metric of evolutionary distance between drug targets as a measure of tasks relatedness. Instance-based MTL significantly outperformed both, feature-based MTL and the base learner, on 741 drug targets out of 1091. Feature-based MTL won on 179 occasions and the base learner performed best on 171 drug targets. We conclude that MTL QSAR is improved by incorporating the evolutionary distance between targets. These results indicate that QSAR learning can be performed effectively, even if little data is available for specific drug targets, by leveraging what is known about similar drug targets. This research was funded by the Engineering and Physical Sciences Research Council (EPSRC) grant EP/K030469/1. NS would like to thank the EU PhenoM-eNal project (Horizon 2020, 654241)
Published: 2020
Full Text: View/download PDF

8. Generating Explainable and Effective Data Descriptors Using Relational Learning: Application to Cancer Biology

Author: Ross D. King, Larisa N. Soldatova, Joseph French, and Oghenejokpeme I. Orhobor
Subjects: 0303 health sciences, Generality, Artificial neural network, business.industry, Computer science, Big data, Statistical relational learning, 02 engineering and technology, Machine learning, computer.software_genre, Datalog, 03 medical and health sciences, Inductive logic programming, Simple (abstract algebra), 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, 030304 developmental biology, computer.programming_language
Abstract: The key to success in machine learning is the use of effective data representations. The success of deep neural networks (DNNs) is based on their ability to utilize multiple neural network layers, and big data, to learn how to convert simple input representations into richer internal representations that are effective for learning. However, these internal representations are sub-symbolic and difficult to explain. In many scientific problems explainable models are required, and the input data is semantically complex and unsuitable for DNNs. This is true in the fundamental problem of understanding the mechanism of cancer drugs, which requires complex background knowledge about the functions of genes/proteins, their cells, and the molecular structure of the drugs. This background knowledge cannot be compactly expressed propositionally, and requires at least the expressive power of Datalog. Here we demonstrate the use of relational learning to generate new data descriptors in such semantically complex background knowledge. These new descriptors are effective: adding them to standard propositional learning methods significantly improves prediction accuracy. They are also explainable, and add to our understanding of cancer. Our approach can readily be expanded to include other complex forms of background knowledge, and combines the generality of relational learning with the efficiency of standard propositional learning.
Published: 2020
Full Text: View/download PDF

9. Federated Ensemble Regression Using Classification

Author: Oghenejokpeme I. Orhobor, Ross D. King, and Larisa N. Soldatova
Subjects: 0301 basic medicine, business.industry, Computer science, 02 engineering and technology, Machine learning, computer.software_genre, Ensemble learning, Regression, Task (project management), 03 medical and health sciences, Improved performance, ComputingMethodologies_PATTERNRECOGNITION, 030104 developmental biology, Multiple Models, 0202 electrical engineering, electronic engineering, information engineering, Predictive power, Learning set, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Regression problems
Abstract: Ensemble learning has been shown to significantly improve predictive accuracy in a variety of machine learning problems. For a given predictive task, the goal of ensemble learning is to improve predictive accuracy by combining the predictive power of multiple models. In this paper, we present an ensemble learning algorithm for regression problems which leverages the distribution of the samples in a learning set to achieve improved performance. We apply the proposed algorithm to a problem in precision medicine where the goal is to predict drug perturbation effects on genes in cancer cell lines. The proposed approach significantly outperforms the base case.
Published: 2020
Full Text: View/download PDF

10. Closed-loop cycles of experiment design, execution, and learning accelerate systems biology model development in yeast

Author: Ross D. King, Jacek Grzebyta, Martin Carpenter, Jan Ramon, Henry Soldano, Céline Rouveirol, Guillaume Santini, Anthony Coutant, Katherine Roper, Larisa N. Soldatova, Daniel Trejo-Banos, Dominique Bouthinon, Mohamed Elati, Laboratoire d'Informatique de Paris-Nord (LIPN), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS), University of Manchester [Manchester], Génomique métabolique (UMR 8030), Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Brunel University London [Uxbridge], Institut de Systématique, Evolution, Biodiversité (ISYEB ), Muséum national d'Histoire naturelle (MNHN)-École Pratique des Hautes Études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université des Antilles (UA), Programme d'Épigénomique, Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Machine Learning in Information Networks (MAGNET), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), University of London [London], The Alan Turing Institute, National Institute of Advanced Industrial Science and Technology (AIST), Université Sorbonne Paris Cité (USPC)-Institut Galilée-Université Paris 13 (UP13)-Centre National de la Recherche Scientifique (CNRS), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE), and Muséum national d'Histoire naturelle (MNHN)-École pratique des hautes études (EPHE)
Subjects: 0301 basic medicine, Computer science, Systems biology, Distributed computing, 0206 medical engineering, Cloud computing, Saccharomyces cerevisiae, 02 engineering and technology, Reuse, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 03 medical and health sciences, Software, Gene Expression Regulation, Fungal, Semantic Web, [SDV.MP.MYC]Life Sciences [q-bio]/Microbiology and Parasitology/Mycology, Multidisciplinary, business.industry, Systems Biology, Computational Biology, diauxic shift, Robotics, Biological Sciences, artificial intelligence, Biophysics and Computational Biology, ComputingMethodologies_PATTERNRECOGNITION, 030104 developmental biology, Laboratory robotics, machine learning, Physical Sciences, Laboratory automation, Robot, [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], business, 020602 bioinformatics
Abstract: Significance Systems biology involves the development of large computational models of biological systems. The radical improvement of systems biology models will necessarily involve the automation of model improvement cycles. We present here a general approach to automating systems biology model improvement. Humans are eukaryotic organisms, and the yeast Saccharomyces cerevisiae is widely used in biology as a “model” for eukaryotic cells. The yeast diauxic shift is the most studied cellular transformation. We combined multiple software tools with integrated laboratory robotics to execute three semiautomated cycles of diauxic shift model improvement. All the experiments were formalized and communicated to a cloud laboratory automation system (Eve) for execution. The resulting improved model is relevant to understanding cancer, the immune system, and aging., One of the most challenging tasks in modern science is the development of systems biology models: Existing models are often very complex but generally have low predictive performance. The construction of high-fidelity models will require hundreds/thousands of cycles of model improvement, yet few current systems biology research studies complete even a single cycle. We combined multiple software tools with integrated laboratory robotics to execute three cycles of model improvement of the prototypical eukaryotic cellular transformation, the yeast (Saccharomyces cerevisiae) diauxic shift. In the first cycle, a model outperforming the best previous diauxic shift model was developed using bioinformatic and systems biology tools. In the second cycle, the model was further improved using automatically planned experiments. In the third cycle, hypothesis-led experiments improved the model to a greater extent than achieved using high-throughput experiments. All of the experiments were formalized and communicated to a cloud laboratory automation system (Eve) for automatic execution, and the results stored on the semantic web for reuse. The final model adds a substantial amount of knowledge about the yeast diauxic shift: 92 genes (+45%), and 1,048 interactions (+147%). This knowledge is also relevant to understanding cancer, the immune system, and aging. We conclude that systems biology software tools can be combined and integrated with laboratory robots in closed-loop cycles.
Published: 2019
Full Text: View/download PDF

11. Deep Learning Does Not Generalize Well to Recognizing Cats and Dogs in Chinese Paintings

Author: Ross D. King and Qianqian Gu
Subjects: Computer science, business.industry, Deep learning, Region proposal, Cognitive neuroscience of visual object recognition, Pattern recognition, Computational aesthetics, 010501 environmental sciences, 01 natural sciences, Object detection, 03 medical and health sciences, 0302 clinical medicine, Semantic memory, Artificial intelligence, business, 030217 neurology & neurosurgery, 0105 earth and related environmental sciences
Abstract: Although Deep Learning (DL) image analysis has made recent rapid advances, it still has limitations that indicate that its approach differs significantly from human vision, e.g. the requirement for large training sets, and adversarial attacks. Here we show that DL also differs in failing to generalize well to Traditional Chinese Paintings (TCPs). We developed a new DL object detection method A-RPN (Assembled Region Proposal Network), which concatenates low-level visual information, and high-level semantic knowledge to reduce coarseness in region-based object detection. A-RPN significantly outperforms YOLO2 and Faster R-CNN on natural images (P < 0.02). We applied YOLO2, Faster R-CNN and A-RPN to TCPs with a 12.9%, 13.2% and 13.4% drop in mAP compared to natural images. There was little or no difference in recognizing humans, but a large drop in mAP for cats and dogs (27% & 31%), and very large drop for horses (35.9%). The abstract nature of TCPs may be responsible for DL poor performance.
Published: 2019
Full Text: View/download PDF

12. Towards the Machine Reading of Arabic Calligraphy: A Letters Dataset and Corresponding Corpus of Text

Author: Seetah Al Salamah and Ross D. King
Subjects: Text corpus, Holy quran, Arabic, Computer science, business.industry, media_common.quotation_subject, Variety (linguistics), computer.software_genre, language.human_language, Calligraphy, Reading (process), Pattern recognition (psychology), language, Artificial intelligence, business, Machine reading, computer, Natural language processing, media_common
Abstract: Arabic calligraphy is one of the great art forms of the world. It displays Arabic phrases, commonly taken from the Holy Quran, in beautiful two-dimensional form. The use of two dimensions, and the interweaving of letters and words makes reading a far greater challenge for Artificial Intelligence (AI) than reading standard printed or hand-written Arabic. To approach this challenge, we have constructed a dataset of Arabic calligraphic letters, along with a corresponding corpus of phrases and quotes. The letters dataset contains a total of 3,467 images for 32 various categories of Arabic calligraphic-type letters. The associated text corpus contains 544 unique quoted phrases. These data were collected from various open sources on the web, and include examples from several Arabic calligraphic styles. We have also undertaken both an explorative statistical analysis of this data, and initial machine learning investigations. These analyses suggest that combining knowledge of a limited variety of Arabic calligraphy texts, with a successful machine will be sufficient for the machine reading of forms of Arabic calligraphy.
Published: 2018
Full Text: View/download PDF

13. Large-Scale Assessment of Deep Relational Machines

Author: Ashwin Srinivasan, Lovekesh Vig, Oghenejokpeme I. Orhobor, Ross D. King, and Tirtharaj Dash
Subjects: business.industry, Computer science, 02 engineering and technology, Space (commercial competition), Machine learning, computer.software_genre, Regression, Domain (software engineering), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Domain knowledge, 020201 artificial intelligence & image processing, Artificial intelligence, Scale (map), business, computer
Abstract: Deep Relational Machines (or DRMs) present a simple way for incorporating complex domain knowledge into deep networks. In a DRM this knowledge is introduced through relational features: in the original formulation of [1], the features are selected by an ILP engine using domain knowledge encoded as logic programs. More recently, in [2], DRMs appear to achieve good performance without the need of feature-selection by an ILP engine (the features are simply drawn randomly from a space of relevant features). The reports so far on DRMs though have been deficient on three counts: (a) They have been tested on very small amounts of data (7 datasets, not all independent, altogether with few 1000s of instances); (b) The background knowledge involved has been modest, involving few 10s of predicates; and (c) Performance assessment has been only on classification tasks. In this paper we rectify each of these shortcomings by testing on datasets from the biochemical domain involving 100s of 1000s of instances; industrial-strength background predicates involving multiple hierarchies of complex definitions; and on classification and regression tasks. Our results provide substantially reliable evidence of the predictive capabilities of DRMs; along with a significant improvement in predictive performance with the incorporation of domain knowledge. We propose the new datasets and results as updated benchmarks for comparative studies in neural-symbolic modelling.
Published: 2018
Full Text: View/download PDF

14. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery

Author: Crina Grosan, Ivan Olier, Joaquin Vanschoren, G. Richard J. Bickerton, Noureddin Sadawi, Ross D. King, Larisa N. Soldatova, Data Mining, Grosan, Crina [0000-0003-1049-2136], and Apollo - University of Cambridge Repository
Subjects: FOS: Computer and information sciences, I.2, 0301 basic medicine, RM, Quantitative structure–activity relationship, ResearchInstitutes_Networks_Beacons/MICRA, Meta learning (computer science), Computer Science - Artificial Intelligence, Computer science, 02 engineering and technology, Q1, Machine learning, computer.software_genre, Article, Machine Learning (cs.LG), Set (abstract data type), 03 medical and health sciences, Algorithm selection, Resource (project management), Meta-learning, Artificial Intelligence, Manchester Institute of Biotechnology, 0202 electrical engineering, electronic engineering, information engineering, QA, Representation (mathematics), business.industry, Drug discovery, QSAR, ResearchInstitutes_Networks_Beacons/manchester_institute_of_biotechnology, R1, Regression, 3. Good health, Random forest, Computer Science - Learning, Artificial Intelligence (cs.AI), 030104 developmental biology, Manchester Institute for Collaborative Research on Ageing, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Software, Applicability domain
Abstract: We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 6 molecular representations, applied to more than 2,700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning method (random forests using a molecular fingerprint representation) by up to 13%, on average. We conclude that meta-learning outperforms base-learning methods for QSAR learning, and as this investigation is one of the most extensive ever comparisons of base and meta-learning methods ever made, it provides evidence for the general effectiveness of meta-learning over base-learning., Comment: 33 pages and 15 figures. Manuscript accepted for publication in Machine Learning Journal. This is the author's pre-print version
Published: 2017
Full Text: View/download PDF

15. Qualitative System Identification from Imperfect Data

Author: Ross D. King, Ashwin Srinivasan, and George M. Coghill
Subjects: FOS: Computer and information sciences, Structure (mathematical logic), Mathematical model, Computer Science - Artificial Intelligence, business.industry, Computer science, System identification, Complex system, Machine learning, computer.software_genre, Set (abstract data type), Identification (information), Artificial Intelligence (cs.AI), Empirical research, Inductive logic programming, Artificial Intelligence, Artificial intelligence, business, computer
Abstract: Experience in the physical sciences suggests that the only realistic means of understanding complex systems is through the use of mathematical models. Typically, this has come to mean the identification of quantitative models expressed as differential equations. Quantitative modelling works best when the structure of the model (i.e., the form of the equations) is known; and the primary concern is one of estimating the values of the parameters in the model. For complex biological systems, the model-structure is rarely known and the modeler has to deal with both model-identification and parameter-estimation. In this paper we are concerned with providing automated assistance to the first of these problems. Specifically, we examine the identification by machine of the structural relationships between experimentally observed variables. These relationship will be expressed in the form of qualitative abstractions of a quantitative model. Such qualitative models may not only provide clues to the precise quantitative model, but also assist in understanding the essence of that model. Our position in this paper is that background knowledge incorporating system modelling principles can be used to constrain effectively the set of good qualitative models. Utilising the model-identification framework provided by Inductive Logic Programming (ILP) we present empirical support for this position using a series of increasingly complex artificial datasets. The results are obtained with qualitative and quantitative data subject to varying amounts of noise and different degrees of sparsity. The results also point to the presence of a set of qualitative states, which we term kernel subsets, that may be necessary for a qualitative model-learner to learn correct models. We demonstrate scalability of the method to biological system modelling by identification of the glycolysis metabolic pathway from data. ©2008 AI Access Foundation. All rights reserved.
Published: 2008
Full Text: View/download PDF

16. Representation of molecular structure using quantum topology with inductive logic programming in structure–activity relationships

Author: Einar Ryeng, Ross D. King, Bård Buttingsrud, and Bjørn K. Alsberg
Subjects: Structure (mathematical logic), Interpretation (logic), Computer science, Atoms in molecules, Quantum topology, Type (model theory), Computer Science Applications, Set (abstract data type), Structure-Activity Relationship, Inductive logic programming, Drug Discovery, Mutagenesis, Site-Directed, Quantum Theory, Physical and Theoretical Chemistry, Representation (mathematics), Algorithm
Abstract: The requirement of aligning each individual molecule in a data set severely limits the type of molecules which can be analysed with traditional structure activity relationship (SAR) methods. A method which solves this problem by using relations between objects is inductive logic programming (ILP). Another advantage of this methodology is its ability to include background knowledge as 1st-order logic. However, previous molecular ILP representations have not been effective in describing the electronic structure of molecules. We present a more unified and comprehensive representation based on Richard Bader's quantum topological atoms in molecules (AIM) theory where critical points in the electron density are connected through a network. AIM theory provides a wealth of chemical information about individual atoms and their bond connections enabling a more flexible and chemically relevant representation. To obtain even more relevant rules with higher coverage, we apply manual postprocessing and interpretation of ILP rules. We have tested the usefulness of the new representation in SAR modelling on classifying compounds of low/high mutagenicity and on a set of factor Xa inhibitors of high and low affinity.
Published: 2006
Full Text: View/download PDF

17. Predicting the Geographical Origin of Music

Author: Fang Zhou, Q. Claire, and Ross D. King
Subjects: Great-circle distance, Structure (mathematical logic), Training set, Computer science, business.industry, Feature extraction, Machine learning, computer.software_genre, Cross-validation, Random forest, Artificial intelligence, Data mining, business, Representation (mathematics), computer
Abstract: Traditional research into the arts has almost always been based around the subjective judgment of human critics. The use of data mining tools to understand art has great promise as it is objective and operational. We investigate the distribution of music from around the world: geographical ethnomusicology. We cast the problem as training a machine learning program to predict the geographical origin of pieces of music. This is a technically interesting problem as it has features of both classification and regression, and because of the spherical geometry of the surface of the Earth. Because of these characteristics of the representation of geographical positions, most standard classification/regression methods cannot be directly used. Two applicable methods are K-Nearest Neighbors and Random forest regression, which are robust to the non-standard structure of data. We also investigated improving performance through use of bagging. We collected 1,142 pieces of music from 73 countries/areas, and described them using 2 different sets of standard audio descriptors using MARSYAS. 10-fold cross validation was used in all experiments. The experimental results indicate that Random forest regression produces significantly better results than KNN, and the use of bagging improves the performance of KNN. The best performing algorithm achieved a mean great circle distance error of 3,113 km.
Published: 2014
Full Text: View/download PDF

18. The Use of Weighted Graphs for Large-Scale Genome Analysis

Author: Ross D. King, Fang Zhou, Hannu Toivonen, Department of Computer Science, Discovery Research Group/Prof. Hannu Toivonen, and Finnish Centre of Excellence in Algorithmic Data Analysis Research (Algodan)
Subjects: Evolutionary Genetics, lcsh:Medicine, Genetic Networks, Genome, Data sequences, Genome, Archaeal, Genome Databases, lcsh:Science, Genome Evolution, Mathematical Computing, Genetics, graphs, Evolutionary Theory, Multidisciplinary, Phylogenetic tree, Genomics, bioinformatics, Biological Evolution, Graph, Enzymes, Isoenzymes, Sequence Analysis, Glycolysis, Algorithms, Metabolic Networks and Pathways, Research Article, Network analysis, education, Sequence Databases, Computational biology, Biology, Models, Biological, Genome Analysis Tools, Evolutionary Modeling, Evolutionary Biology, Bacteria, lcsh:R, Computational Biology, Genomic Evolution, Biological evolution, data mining, Data structure, 113 Computer and information sciences, Computing Methods, Archaea, Computer Science, Mutation, lcsh:Q, Genome, Bacterial
Abstract: There is an acute need for better tools to extract knowledge from the growing flood of sequence data. For example, thousands of complete genomes have been sequenced, and their metabolic networks inferred. Such data should enable a better understanding of evolution. However, most existing network analysis methods are based on pair-wise comparisons, and these do not scale to thousands of genomes. Here we propose the use of weighted graphs as a data structure to enable large-scale phylogenetic analysis of networks. We have developed three types of weighted graph for enzymes: taxonomic (these summarize phylogenetic importance), isoenzymatic (these summarize enzymatic variety/redundancy), and sequence-similarity (these summarize sequence conservation); and we applied these types of weighted graph to survey prokaryotic metabolism. To demonstrate the utility of this approach we have compared and contrasted the large-scale evolution of metabolism in Archaea and Eubacteria. Our results provide evidence for limits to the contingency of evolution.
Published: 2014
Full Text: View/download PDF

19. EXACT2: the semantics of biomedical protocols

Author: Nigel J. Saunders, Piyali Basu, Brian B. Rudkin, Ross D. King, Daniel Nadis, Wolfgang Marwan, Véronique Baumle, Larisa N. Soldatova, and Emma Haddi
Subjects: Computer science, Semantics (computer science), Ontology (information science), computer.software_genre, Biochemistry, Text mining, Software, Structural Biology, Data Mining, Biomedical protocols, Molecular Biology, Reference model, Biomedicine, Language, Reproducibility, Electronic Data Processing, business.industry, Applied Mathematics, Research, Reproducibility of Results, Biological Ontologies, Construct (python library), Replication (computing), Computer Science Applications, Semantics, EXACT2, Ontology, Data mining, business, Software engineering, computer, Natural language
Abstract: © 2014 Soldatova et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. This article has been made available through the Brunel Open Access Publishing Fund. Background: The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility. Methods: We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility. To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously ‘unseen’ (not used for the construction of EXACT2) protocols. Results: The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed. Conclusions: The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format. This work has been partially funded by the Brunel University BRIEF award and a grant from Occams Resources.
Published: 2014

20. [Untitled]

Author: Luc Dehaspe, Ashwin Srinivasan, and Ross D. King
Subjects: Structure (mathematical logic), PROGOL, Artificial neural network, business.industry, Computer science, Probabilistic logic, InformationSystems_DATABASEMANAGEMENT, computer.software_genre, Machine learning, Field (computer science), Computer Science Applications, Inductive logic programming, Drug Discovery, Data mining, Artificial intelligence, Physical and Theoretical Chemistry, business, computer, Chemical database, Test data, computer.programming_language
Abstract: Data mining techniques are becoming increasingly important in chemistry as databases become too large to examine manually. Data mining methods from the field of Inductive Logic Programming (ILP) have potential advantages for structural chemical data. In this paper we present Warmr, the first ILP data mining algorithm to be applied to chemoinformatic data. We illustrate the value of Warmr by applying it to a well studied database of chemical compounds tested for carcinogenicity in rodents. Data mining was used to find all frequent substructures in the database, and knowledge of these frequent substructures is shown to add value to the database. One use of the frequent substructures was to convert them into probabilistic prediction rules relating compound description to carcinogenesis. These rules were found to be accurate on test data, and to give some insight into the relationship between structure and activity in carcinogenesis. The substructures were also used to prove that there existed no accurate rule, based purely on atom-bond substructure with less than seven conditions, that could predict carcinogenicity. This results put a lower bound on the complexity of the relationship between chemical structure and carcinogenicity. Only by using a data mining algorithm, and by doing a complete search, is it possible to prove such a result. Finally the frequent substructures were shown to add value by increasing the accuracy of statistical and machine learning programs that were trained to predict chemical carcinogenicity. We conclude that Warmr, and ILP data mining methods generally, are an important new tool for analysing chemical databases.
Published: 2001
Full Text: View/download PDF

21. Cascaded multiple classifiers for secondary structure prediction

Author: Mohammed Ouali and Ross D. King
Subjects: Artificial neural network, Computer science, business.industry, Proteins, Pattern recognition, Bioinformatics, Protein secondary structure prediction, Biochemistry, Test protein, Protein Structure, Secondary, Homologous Sequences, Prediction methods, Resampling, Neural Networks, Computer, Artificial intelligence, business, Molecular Biology, Protein secondary structure, Classifier (UML), Research Article
Abstract: We describe a new classifier for protein secondary structure prediction that is formed by cascading together different types of classifiers using neural networks and linear discrimination. The new classifier achieves an accuracy of 76.7% (assessed by a rigorous full Jack-knife procedure) on a new nonredundant dataset of 496 nonhomologous sequences (obtained from G.J. Barton and J.A. Cuff). This database was especially designed to train and test protein secondary structure prediction methods, and it uses a more stringent definition of homologous sequence than in previous studies. We show that it is possible to design classifiers that can highly discriminate the three classes (H, E, C) with an accuracy of up to 78% for beta-strands, using only a local window and resampling techniques. This indicates that the importance of long-range interactions for the prediction of beta-strands has been probably previously overestimated.
Published: 2000
Full Text: View/download PDF

22. [Untitled]

Author: Ashwin Srinivasan and Ross D. King
Subjects: Quantitative structure–activity relationship, Computer Networks and Communications, business.industry, Process (engineering), Computer science, Machine learning, computer.software_genre, Field (computer science), Computer Science Applications, Task (project management), Quantitative analysis (finance), Inductive logic programming, Feature (machine learning), Artificial intelligence, Data mining, business, Construct (philosophy), computer, Information Systems
Abstract: Recently, computer programs developed within the field of Inductive Logic Programming (ILP) have received some attention for their ability to construct restricted first-order logic solutions using problem-specific background knowledge. Prominent applications of such programs have been concerned with determining “structure-activity” relationships in the areas of molecular biology and chemistry. Typically the task here is to predict the “activity” of a compound (for example, toxicity), from its chemical structure. A summary of the research in the area is: (a) ILP programs have largely been restricted to qualitative predictions of activity (“high”, “low” etc.)s (b) When appropriate attributes are available, ILP programs have equivalent predictivity to standard quantitative analysis techniques like linear regression. However ILP programs usually perform better when such attributes are unavailables and (c) By using structural information as background knowledge, an ILP program can provide comprehensible explanations for biological activity. This paper examines the use of ILP programs as a method of “discovering” new attributes. These attributes could then be used by methods like linear regression, thus allowing for quantitative predictions while retaining the ability to use structural information as background knowledge. Using structure-activity tasks as a test-bed, the utility of ILP programs in constructing new features was evaluated by examining the prediction of biological activity using linear regression, with and without the aid of ILP learnt logical attributes. In three out of the five data sets examined the addition of ILP attributes produced statistically better results. In addition six important structural features that have escaped the attention of the expert chemists were discovered. The method used here to construct new attributes is not specific to the problem of predicting biological activity, and the results obtained suggest a wider role for ILP programs in aiding the process of scientific discovery.
Published: 1999
Full Text: View/download PDF

23. Topic Models with Relational Features for Drug Design

Author: Ashwin Srinivasan, Tanveer A. Faruquie, and Ross D. King
Subjects: Topic model, Computer science, business.industry, Bayesian probability, Probabilistic logic, Machine learning, computer.software_genre, Latent Dirichlet allocation, symbols.namesake, Inductive logic programming, Concept learning, Feature (machine learning), symbols, Artificial intelligence, business, Representation (mathematics), computer
Abstract: To date, ILP models in drug design have largely focussed on models in first-order logic that relate two- or three-dimensional molecular structure of a potential drug (a ligand) to its activity (for example, inhibition of some protein). In modelling terms: (a) the models have largely been logic-based (although there have been some attempts at probabilistic models); (b) the models have been mostly of a discriminatory nature (they have been mainly used for classification tasks); and (c) data for concepts to be learned are usually provided explicitly: “hidden” or latent concept learning is rare. Each of these aspects imposes certain limitations on the use of such models for drug design. Here, we propose the use of “topic models”—correctly, hierarchical Bayesian models—as a general and powerful modelling technique for drug design. Specifically, we use the feature-construction cabilities of a general-purpose ILP system to incorporate complex relational information into topic models for drug-like molecules. Our main interest in this paper is to describe computational tools to assist the discovery of drugs for malaria. To this end, we describe the construction of topic models using the GlaxoSmithKline Tres Cantos Antimalarial TCAMS dataset. This consists of about 13,000 inhibitors of the 3D7 strain of P. falciparum in human erythrocytes, obtained by screening of approximately 2 million compounds. We investigate the discrimination of molecules into groups (for example, “more active” and “less active”). For this task, we present evidence that suggests that when it is important to maximise the detection of molecules with high activity (“hits”), topic-based classifiers may be better than those that operate directly on the feature-space representation of the molecules. Besides the applicability for modelling anti-malarials, an obvious utility of topic-modelling as a technique of reducing the dimensionality of ILP-constructed feature spaces is also apparent.
Published: 2013
Full Text: View/download PDF

24. Representation of probabilistic scientific knowledge

Author: Kurt De Grave, Andrey Rzhetsky, Ross D. King, and Larisa N. Soldatova
Subjects: Knowledge representation and reasoning, Relation (database), Computer Networks and Communications, Computer science, Active learning (machine learning), Inference, Health Informatics, Probabilistic reasoning, Ontology (information science), Machine learning, computer.software_genre, 03 medical and health sciences, 0302 clinical medicine, Selection (linguistics), ontology, Representation (mathematics), probabilistic reasoning, 030304 developmental biology, 0303 health sciences, Information retrieval, business.industry, Ontology, knowledge representation, Probabilistic logic, Computer Science Applications, Proceedings, Knowledge representation, 030220 oncology & carcinogenesis, Artificial intelligence, business, computer, Information Systems
Abstract: This article is available through the Brunel Open Access Publishing Fund. Copyright © 2013 Soldatova et al; licensee BioMed Central Ltd. The theory of probability is widely used in biomedical research for data analysis and modelling. In previous work the probabilities of the research hypotheses have been recorded as experimental metadata. The ontology HELO is designed to support probabilistic reasoning, and provides semantic descriptors for reporting on research that involves operations with probabilities. HELO explicitly links research statements such as hypotheses, models, laws, conclusions, etc. to the associated probabilities of these statements being true. HELO enables the explicit semantic representation and accurate recording of probabilities in hypotheses, as well as the inference methods used to generate and update those hypotheses. We demonstrate the utility of HELO on three worked examples: changes in the probability of the hypothesis that sirtuins regulate human life span; changes in the probability of hypotheses about gene functions in the S. cerevisiae aromatic amino acid pathway; and the use of active learning in drug design (quantitative structure activity relation learning), where a strategy for the selection of compounds with the highest probability of improving on the best known compound was used. HELO is open source and available at https://github.com/larisa-soldatova/HELO. This work was partially supported by grant BB/F008228/1 from the UK Biotechnology & Biological Sciences Research Council, from the European Commission under the FP7 Collaborative Programme, UNICELLSYS, KU Leuven GOA/08/008 and ERC Starting Grant 240186.
Published: 2013

25. Relating chemical activity to structure: An examination of ILP successes

Author: Michael J.E. Sternberg, Ross D. King, and Ashwin Srinivasan
Subjects: Structure (mathematical logic), Matching (statistics), Computer science, business.industry, Computer Networks and Communications, Decision tree, Construct (python library), Machine learning, computer.software_genre, Theoretical Computer Science, Inductive logic programming, Hardware and Architecture, Encoding (memory), Feature (machine learning), Artificial intelligence, business, Representation (mathematics), computer, Software
Abstract: Problems concerned with learning the relationships between molecular structure and activity have been important test-beds for Inductive Logic programming (ILP) systems. In this paper we examine these applications and empirically evaluate the extent to which a first-order representation was required. We compared ILP theories with those constructed using standard linear regression and a decision-tree learner on a series of progressively more difficult problems. When a propositional encoding is feasible for the feature-based algorithms, we show that such algorithms are capable of matching the predictive accuracies of an ILP theory. However, as the complexity of the compounds considered increased, propositional encodings becomes intractable. In such cases, our results show that ILP programs can still continue to construct accurate, understandable theories. Based on this evidence, we propose future work to realise fully the potential of ILP in structure-activity problem.
Published: 1995
Full Text: View/download PDF

26. STATLOG: COMPARISON OF CLASSIFICATION ALGORITHMS ON LARGE REAL-WORLD PROBLEMS

Author: Ross D. King, Cao Feng, and Alistair Sutherland
Subjects: Artificial neural network, business.industry, Computer science, Bayesian network, Pattern recognition, Linear discriminant analysis, Machine learning, computer.software_genre, Data set, Naive Bayes classifier, Statistical classification, ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence, Projection pursuit, Artificial intelligence, business, Categorical variable, computer
Abstract: This paper describes work in the StatLog project comparing classification algorithms on large real-world problems. The algorithms compared were from symbolic learning (CART. C4.5, NewID, AC2,ITrule, Cal5, CN2), statistics (Naive Bayes, k-nearest neighbor, kernel density, linear discriminant, quadratic discriminant, logistic regression, projection pursuit, Bayesian networks), and neural networks (backpropagation, radial basis functions). Twelve datasets were used: five from image analysis, three from medicine, and two each from engineering and finance. We found that which algorithm performed best depended critically on the data set investigated. We therefore developed a set of data set descriptors to help decide which algorithms are suited to particular data sets. For example, data sets with extreme distributions (skew > l and kurtosis > 7) and with many binary/categorical attributes (>38%) tend to favor symbolic learning algorithms. We suggest how classification algorithms can be extended in a number of d...
Published: 1995
Full Text: View/download PDF

27. COMPARISON OF ARTIFICIAL INTELLIGENCE METHODS FOR MODELING PHARMACEUTICAL QSARS

Author: Ross D. King, Jonathan D. Hirst, and Michael J.E. Sternberg
Subjects: Quantitative structure–activity relationship, Artificial neural network, biology, business.industry, Computer science, Rank (computer programming), Statistical difference, Machine learning, computer.software_genre, Artificial Intelligence, Linear regression, Dihydrofolate reductase, biology.protein, Artificial intelligence, business, computer
Abstract: A common step in pharmaceutical development is the formation of a quantitative structure-activity relationship *(QSAR) to model an exploratory series of compounds. A QSAR generalizes how the structure (shape) of a compound relates to its biological activity. A comparative study was carried out of six artificial intelligence and traditional algorithms for modeling QSAR's: GOLEM, CART, and MS from symbolic machine learning; back-propagation from neural networks; and linear regression and nearest-neighbor from traditional statistics. Two test case problems were studied: the inhibition of Escherichia coli dihydrofolate reductase (DHFR) by pyrimidines, and the inhibition of ratlmouse tumor DHFR by triazines. It was found that there was no significant statistical difference between the methods in terms of their ability to rank unseen compounds by activity. However, symbolic machine learning methods, in particular relational ones, were found to generate rules that provided insight into the stereochemistry of com...
Published: 1995
Full Text: View/download PDF

28. Applications of inductive logic programming

Author: Ross D. King and Ivan Bratko
Subjects: Theoretical computer science, Inductive logic programming, Computer science, General Earth and Planetary Sciences, Polygon mesh, Inductive programming, General Environmental Science
Abstract: Some applications of Inductive Logic Programming (ILP) are presented. Those applications are chosen that specifically benefit from relational descriptions generated by ILP programs, and from ILP's ability to accommodate background knowledge. Applications included are: drug design, predicting the secondary structure of proteins, and design of finite-element meshes. Some other applications are briefly described. The practical advantages and disadvantages of ILP learning are discussed.
Published: 1994
Full Text: View/download PDF

29. On the formalization and reuse of scientific research

Author: Ross D. King, Stephen G. Oliver, Larisa N. Soldatova, Chuan Lu, and Maria Liakata
Subjects: Biomedical Research, Computer science, Formalism (philosophy), Logic, Systems biology, Biomedical Engineering, Biophysics, Bioengineering, Saccharomyces cerevisiae, Reuse, Ontology (information science), computer.software_genre, Biochemistry, Biomaterials, Fungal Proteins, semantic web, Gene Expression Regulation, Fungal, Computer Simulation, ontology, Semantic Web, Research Articles, Fungal protein, logic, business.industry, Information Dissemination, Ontology, Systems Biology, Genomics, Models, Theoretical, Rotation formalisms in three dimensions, Metadata, Data mining, Software engineering, business, computer, Semantic web, Biotechnology
Abstract: The reuse of scientific knowledge obtained from one investigation in another investigation is basic to the advance of science. Scientific investigations should therefore be recorded in ways that promote the reuse of the knowledge they generate. The use of logical formalisms to describe scientific knowledge has potential advantages in facilitating such reuse. Here, we propose a formal framework for using logical formalisms to promote reuse. We demonstrate the utility of this framework by using it in a worked example from biology: demonstrating cycles of investigation formalization [ F ] and reuse [ R ] to generate new knowledge. We first used logic to formally describe a Robot scientist investigation into yeast ( Saccharomyces cerevisiae ) functional genomics [ f 1 ]. With Robot scientists, unlike human scientists, the production of comprehensive metadata about their investigations is a natural by-product of the way they work. We then demonstrated how this formalism enabled the reuse of the research in investigating yeast phenotypes [ r 1 = R ( f 1 )]. This investigation found that the removal of non-essential enzymes generally resulted in enhanced growth. The phenotype investigation was then formally described using the same logical formalism as the functional genomics investigation [ f 2 = F ( r 1 )]. We then demonstrated how this formalism enabled the reuse of the phenotype investigation to investigate yeast systems-biology modelling [ r 2 = R ( f 2 )]. This investigation found that yeast flux-balance analysis models fail to predict the observed changes in growth. Finally, the systems biology investigation was formalized for reuse in future investigations [ f 3 = F ( r 2 )]. These cycles of reuse are a model for the general reuse of scientific knowledge.
Published: 2011
Full Text: View/download PDF

30. Representation, Simulation, and Hypothesis Generation in Graph and Logical Models of Biological Networks

Author: Ken Whelan, Ross D. King, and Oliver Ray
Subjects: Power graph analysis, ComputingMethodologies_PATTERNRECOGNITION, Theoretical computer science, Computer science, And-inverter graph, Metabolic modeling, Graph (abstract data type), Graph theory, Metabolism, Graph, Biological network, Coarse structure
Abstract: This chapter presents a discussion of metabolic modeling from graph theory and logical modeling perspectives. These perspectives are closely related and focus on the coarse structure of metabolism, rather than the finer details of system behavior. The models have been used as background knowledge for hypothesis generation by Robot Scientists using yeast as a model eukaryote, where experimentation and machine learning are used to identify additional knowledge to improve the metabolic model. The logical modeling concept is being adapted to cell signaling and transduction biological networks.
Published: 2011
Full Text: View/download PDF

31. New approaches to QSAR: Neural networks and machine learning

Author: Michael J.E. Sternberg, Ross D. King, and Jonathan D. Hirst
Subjects: Pharmacology, Quantitative structure–activity relationship, Artificial neural network, biology, Statistical assumption, business.industry, Computer science, Organic Chemistry, Nonparametric statistics, Machine learning, computer.software_genre, Field (computer science), Nonlinear system, Drug Discovery, Linear regression, Dihydrofolate reductase, biology.protein, Artificial intelligence, business, computer
Abstract: Neural networks and machine learning are two methods that are increasingly being used to model QSARs. They make few statistical assumptions and are nonlinear and nonparametric. We describe back-propagation from the field of neural networks, and GOLEM from machine learning, and illustrate their learning mechanisms using a simple expository problem. Back-propagation and GOLEM are then compared with multiple linear regression (using the parameters and their squares) on two real drug design problems: the inhibition ofEscherichia coli dihydrofolate reductase (DHFR) by pyrimidines and the inhibition of rat/mouse tumour DHFR by triazines.
Published: 1993
Full Text: View/download PDF

32. Further developments towards a genome-scale metabolic model of yeast

Author: Ross D. King, Marie Brown, Evangelos Simeonidis, Paul R. Fisher, Robert Stevens, Paul D. Dobson, Pedro Mendes, Douglas B. Kell, Stephen G. Oliver, Daniel Jameson, Pınar Pir, Kieran Smallbone, Olusegun Oshota, Chuan-Zhen Lu, Neil Swainston, Duncan Hull, Natalie J. Stanford, Warwick B. Dunn, Karin Lanthaler, King, Ross [0000-0001-7208-4387], Oliver, Stephen [0000-0001-6330-7526], and Apollo - University of Cambridge Repository
Subjects: reconstruction, Computer science, Systems biology, Saccharomyces cerevisiae, Computational biology, Models, Biological, information, 03 medical and health sciences, Metabolomics, markup, promiscuity, Structural Biology, Modelling and Simulation, Manchester Institute of Biotechnology, SBML, genes, lcsh:QH301-705.5, Molecular Biology, science, 030304 developmental biology, Genetics, 0303 health sciences, biology, Applied Mathematics, 030302 biochemistry & molecular biology, Lipid metabolism, systems biology, Molecular Sequence Annotation, biology.organism_classification, ResearchInstitutes_Networks_Beacons/manchester_institute_of_biotechnology, Lipid Metabolism, Yeast, Computer Science Applications, Flux balance analysis, sbml, lcsh:Biology (General), Modeling and Simulation, network, saccharomyces-cerevisiae, Genome, Fungal, Software, Research Article
Abstract: Background To date, several genome-scale network reconstructions have been used to describe the metabolism of the yeast Saccharomyces cerevisiae, each differing in scope and content. The recent community-driven reconstruction, while rigorously evidenced and well annotated, under-represented metabolite transport, lipid metabolism and other pathways, and was not amenable to constraint-based analyses because of lack of pathway connectivity. Results We have expanded the yeast network reconstruction to incorporate many new reactions from the literature and represented these in a well-annotated and standards-compliant manner. The new reconstruction comprises 1102 unique metabolic reactions involving 924 unique metabolites - significantly larger in scope than any previous reconstruction. The representation of lipid metabolism in particular has improved, with 234 out of 268 enzymes linked to lipid metabolism now present in at least one reaction. Connectivity is emphatically improved, with more than 90% of metabolites now reachable from the growth medium constituents. The present updates allow constraint-based analyses to be performed; viability predictions of single knockouts are comparable to results from in vivo experiments and to those of previous reconstructions. Conclusions We report the development of the most complete reconstruction of yeast metabolism to date that is based upon reliable literature evidence and richly annotated according to MIRIAM standards. The reconstruction is available in the Systems Biology Markup Language (SBML) and via a publicly accessible database http://www.comp-sys-bio.org/yeastnet/.
Published: 2010
Full Text: View/download PDF

33. Logic-Based Steady-State Analysis and Revision of Metabolic Networks with Inhibition

Author: Ken Whelan, Oliver Ray, and Ross D. King
Subjects: Steady state, Theoretical computer science, Metabolic Model, Computer science, Semantics (computer science), Commonsense reasoning, Non-monotonic logic, Set (psychology), Logic programming
Abstract: This paper presents a qualitative logic-based method for the steady-state analysis and revision of metabolic networks with inhibition. The approach is able to automatically revise an initial metabolic model -- through the addition and removal of whole reactions or individual substrates, products and inhibitors -- in order to ensure the existence of a steady-state behaviour consistent with a set of experimental observations. We show how this can be done in a nonmonotonic logic programming setting and discuss the challenges that arise when metabolic cycles or mutual inhibitions occur in the underlying network.
Published: 2010
Full Text: View/download PDF

34. Inductive Queries for a Drug Designing Robot Scientist

Author: Jem J. Rowland, Ross D. King, Amanda Clare, Siegfried Nijssen, Andrew Sparkes, Jan Ramon, Amanda C. Schierz, Dzeroski, Saso, Goethals, Bart, Panov, Pance, and UCL - SST/ICTM/INGI - Pôle en ingénierie informatique
Subjects: Discovery science, ge, business.industry, Computer science, aintel, Machine learning, computer.software_genre, Drug design, Business process discovery, Knowledge extraction, Inductive logic programming, chem, Key (cryptography), Robot, Artificial intelligence, Representation (mathematics), business, Adaptation (computer science), Data mining, computer
Abstract: It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments. ispartof: Inductive Databases and Constraint-Based Data Mining pages:425-451 ispartof: pages:425-451 status: published
Published: 2010
Full Text: View/download PDF

35. Automatic Revision of Metabolic Networks through Logical Analysis of Experimental Data

Author: Ken Whelan, Ross D. King, and Oliver Ray
Subjects: Reasoning system, Logical analysis, Computer science, business.industry, Metabolic network, Robot, Experimental data, Artificial intelligence, Non-monotonic logic, Inductive reasoning, business
Abstract: This paper presents a nonmonotonic ILP approach for the automatic revision of metabolic networks through the logical analysis of experimental data. The method extends previous work in two respects: by suggesting revisions that involve both the addition and removal of information; and by suggesting revisions that involve combinations of gene functions, enzyme inhibitions, and metabolic reactions. Our proposal is based on a new declarative model of metabolism expressed in a nonmonotonic logic programming formalism. With respect to this model, a mixture of abductive and inductive inference is used to compute a set of minimal revisions needed to make a given network consistent with some observed data. In this way, we describe how a reasoning system called XHAIL was able to correctly revise a state-of-the-art metabolic pathway in the light of real-world experimental data acquired by an autonomous laboratory platform called the Robot Scientist.
Published: 2010
Full Text: View/download PDF

36. IPSA—Inductive Protein Structure Analysis

Author: Ross D. King and Steffen Schulze-Kremer
Subjects: Models, Molecular, Structure (mathematical logic), Theoretical computer science, Databases, Factual, Protein Conformation, Computer science, Proteins, Bioengineering, Biochemistry, Structure-Activity Relationship, Alpha (programming language), Range (mathematics), Protein structure, Simple (abstract algebra), Sequence Homology, Nucleic Acid, Consensus clustering, Cluster Analysis, Cluster analysis, Sequence Alignment, Molecular Biology, Protein secondary structure, Software, Biotechnology
Abstract: The Inductive Structure Protein Analysis (IPSA) project presents a new method for investigating protein structure. IPSA includes the creation of a new database which was designed specifically for the analysis of protein structure by statistics and machine learning. The Protein Representation Language (PRL) database includes explicit and symbolic representations of geometrical, topological and chemophysical information about secondary structures and the relationships between secondary structures. The IPSA methodology consists of: the use of PRL information to produce a new database of examples of secondary structures which associate together (examples of possible super-secondary structures); then the use of a variety of clustering techniques to produce a consensus clustering of these examples (super-secondary structures); these super-secondary structures are finally examined to uncover any biological features of significance. We have applied this method to find simple super-secondary structures consisting of pairs of alpha-helices. We found four well-defined super-secondary structures, one formed exclusively by long range interactions, and another in association with an additional element of secondary structure (alpha t alpha-motif). Examinations were carried out using homologous pairs and conformational fits which confirm our clustering.
Published: 1992
Full Text: View/download PDF

37. Protein secondary structure prediction using logic-based machine learning

Author: Michael J.E. Sternberg, Ross D. King, and Stephen Muggleton
Subjects: Databases, Factual, Computer science, Molecular Sequence Data, Bioengineering, Machine learning, computer.software_genre, Biochemistry, Protein Structure, Secondary, Domain (software engineering), Artificial Intelligence, Computer Simulation, Amino Acid Sequence, Mathematical Computing, Molecular Biology, Protein secondary structure, Structure (mathematical logic), Computer program, Artificial neural network, business.industry, Reproducibility of Results, Small set, Alpha (programming language), Models, Chemical, Inductive logic programming, Artificial intelligence, business, computer, Biotechnology
Abstract: Many attempts have been made to solve the problem of predicting protein secondary structure from the primary sequence but the best performance results are still disappointing. In this paper, the use of a machine learning algorithm which allows relational descriptions is shown to lead to improved performance. The Inductive Logic Programming computer program, Golem, was applied to learning secondary structure prediction rules for alpha/alpha domain type proteins. The input to the program consisted of 12 non-homologous proteins (1612 residues) of known structure, together with a background knowledge describing the chemical and physical properties of the residues. Golem learned a small set of rules that predict which residues are part of the alpha-helices--based on their positional relationships and chemical and physical properties. The rules were tested on four independent non-homologous proteins (416 residues) giving an accuracy of 81% (+/- 2%). This is an improvement, on identical data, over the previously reported result of 73% by King and Sternberg (1990, J. Mol. Biol., 216, 441-457) using the machine learning program PROMIS, and of 72% using the standard Garnier-Osguthorpe-Robson method. The best previously reported result in the literature for the alpha/alpha domain type is 76%, achieved using a neural net approach. Machine learning also has the advantage over neural network and statistical methods in producing more understandable results.
Published: 1992
Full Text: View/download PDF

38. A Nonmonotonic Logical Approach for Modelling and Revising Metabolic Networks

Author: Ross D. King, Ken Whelan, and Oliver Ray
Subjects: Formalism (philosophy of mathematics), Theoretical computer science, Computer science, business.industry, Logical approach, Artificial intelligence, Non-monotonic logic, business, Logic programming
Abstract: his paper describes a new logic-based approach for representing and reasoning about metabolic networks.First it shows how biological pathways can be elegantly represented in a logic programming formalism able to model full chemical reactions with substrates and products in different cell compartments, and which are catalysed by iso-enzymes or enzyme-complexes that are subject to inhibitory feedbacks.Then it shows how a nonmonotonic reasoning system called XHAIL can be used as a practical method for learning and revising such metabolic networks from observational data.Preliminary results are described in which the approach is validated on a state-of-the-art model of Aromatic Amino Acid biosynthesis.
Published: 2009
Full Text: View/download PDF

39. Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals from Screening-Library Compounds

Author: Amanda C. Schierz and Ross D. King
Subjects: Drug, Absorption (pharmacology), Drug likeness, Computer science, media_common.quotation_subject, Lipinski's rule of five, Computational biology, Pharmacology, media_common
Abstract: Compounds in drug screening-libraries should resemble pharmaceuticals. To operationally test this, we analysed the compounds in terms of known drug-like filters and developed a novel machine learning method to discriminate approved pharmaceuticals from "drug-like" compounds. This method uses both structural features and molecular properties for discrimination. The method has an estimated accuracy of 91% in discriminating between the Maybridge HitFinder library and approved pharmaceuticals, and 99% between the NATDiverse collection (from Analyticon Discovery) and approved pharmaceuticals. These results show that Lipinski's Rule of 5 for oral absorption is not sufficient to describe "drug-likeness" and be the main basis of screening-library design.
Published: 2009
Full Text: View/download PDF

40. The EXACT description of biomedical protocols

Author: Amanda Clare, Larisa N. Soldatova, Wayne Aubrey, and Ross D. King
Subjects: Statistics and Probability, Databases, Factual, Computer science, Databases and Ontologies, Clinical Biochemistry, Information Storage and Retrieval, Documentation, Ontology (information science), computer.software_genre, Biochemistry, Experiment protocols, Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto, Code (cryptography), Laboratory science, Experiment ACTions ontology, Molecular Biology, Internet, Information retrieval, business.industry, Research, Original Papers, Computer Science Applications, Biological laboratory protocols, Computational Mathematics, Computational Theory and Mathematics, Ontology, Database Management Systems, The Internet, Data mining, business, computer
Abstract: © 2008 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Motivation: Many published manuscripts contain experiment protocols which are poorly described or deficient in information. This means that the published results are very hard or impossible to repeat. This problem is being made worse by the increasing complexity of high-throughput/automated methods. There is therefore a growing need to represent experiment protocols in an efficient and unambiguous way. Results: We have developed the Experiment ACTions (EXACT) ontology as the basis of a method of representing biological laboratory protocols. We provide example protocols that have been formalized using EXACT, and demonstrate the advantages and opportunities created by using this formalization. We argue that the use of EXACT will result in the publication of protocols with increased clarity and usefulness to the scientific community. Availability: The ontology, examples and code can be downloaded from http://www.aber.ac.uk/compsci/Research/bio/dss/EXACT/ RC UK, RAEng/EPSRC, and BBSRC.
Published: 2008
Full Text: View/download PDF

41. Using a logical model to predict the growth of yeast

Author: Ken Whelan and Ross D. King
Subjects: Correctness, Theoretical computer science, Saccharomyces cerevisiae Proteins, Knowledge representation and reasoning, Computer science, Metabolic network, Saccharomyces cerevisiae, lcsh:Computer applications to medicine. Medical informatics, Biochemistry, Models, Biological, Structural Biology, Logical data model, Computer Simulation, Molecular Biology, lcsh:QH301-705.5, Cell Proliferation, Predicate logic, Applied Mathematics, System identification, Flux balance analysis, Computer Science Applications, ComputingMethodologies_PATTERNRECOGNITION, Logistic Models, lcsh:Biology (General), Logical form, lcsh:R858-859.7, Signal transduction, Research Article, Signal Transduction
Abstract: Background A logical model of the known metabolic processes in S. cerevisiae was constructed from iFF708, an existing Flux Balance Analysis (FBA) model, and augmented with information from the KEGG online pathway database. The use of predicate logic as the knowledge representation for modelling enables an explicit representation of the structure of the metabolic network, and enables logical inference techniques to be used for model identification/improvement. Results Compared to the FBA model, the logical model has information on an additional 263 putative genes and 247 additional reactions. The correctness of this model was evaluated by comparison with iND750 (an updated FBA model closely related to iFF708) by evaluating the performance of both models on predicting empirical minimal medium growth data/essential gene listings. Conclusion ROC analysis and other statistical studies revealed that use of the simpler logical form and larger coverage results in no significant degradation of performance compared to iND750.
Published: 2008

42. Active Learning for Regression Based on Query by Committee

Author: Jem J. Rowland, Robert Burbidge, and Ross D. King
Subjects: Training set, Active learning (machine learning), Computer science, business.industry, Variance (accounting), Overfitting, Machine learning, computer.software_genre, Class (biology), Regression, Minification, Artificial intelligence, business, computer, Selection (genetic algorithm)
Abstract: We investigate a committee-based approach for active learning of real-valued functions. This is a variance-only strategy for selection of informative training data. As such it is shown to suffer when the model class is misspecified since the learner's bias is high. Conversely, the strategy outperforms passive selection when the model class is very expressive since active minimization of the variance avoids overfitting.
Published: 2007
Full Text: View/download PDF

43. Logic and the Automatic Acquisition of Scientific Knowledge: An Application to Functional Genomics

Author: Luc Dehaspe, Amanda Clare, Ross D. King, and Andreas Karwath
Subjects: Discovery science, Sociology of scientific knowledge, Scientific technique, Inductive logic programming, Point (typography), Computer science, Process (engineering), Data analysis, Data science, Task (project management)
Abstract: This paper is a manifesto aimed at computer scientists interested in developing and applying scientific discovery methods. It argues that: science is experiencing an unprecedented "explosion" in the amount of available data; traditional data analysis methods cannot deal with this increased quantity of data; there is an urgent need to automate the process of refining scientific data into scientific knowledge; inductive logic programming (ILP) is a data analysis framework well suited for this task; and exciting new scientific discoveries can be achieved using ILP scientific discovery methods. We describe an example of using ILP to analyse a large and complex bioinformatic database that has produced unexpected and interesting scientific results in functional genomics. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent databases.
Published: 2007
Full Text: View/download PDF

44. Learning Qualitative Models of Physical and Biological Systems

Author: Simon Garrett, Ross D. King, George M. Coghill, and Ashwin Srinivasan
Subjects: business.industry, Computer science, Scientific discovery, computer.software_genre, Machine learning, Outcome (game theory), Identification (information), Inductive logic programming, Kernel (statistics), Key (cryptography), Artificial intelligence, Data mining, Noise (video), business, Set (psychology), computer
Abstract: We present a qualitative model-learning system, Qoph , developed for application to scientific discovery problems. Qoph learns the structuralrelations between a set of observed variables. It has been shown capable of learning models with intermediate (unmeasured) variables, and intermediate relations, under different levels of noise, and from qualitative or quantitative data. A biological application of Qoph is explored. An additional significant outcome of this work is the discovery and identification of kernel subsets of key states that must be present for model-learning to succeed.
Published: 2007
Full Text: View/download PDF

45. Overhauling the PDB

Author: Larisa N. Soldatova, Amanda C Schierz, and Ross D. King
Subjects: Database, Computer science, Biomedical Engineering, Protein Data Bank (RCSB PDB), Bioengineering, computer.file_format, Models, Theoretical, Protein Data Bank, computer.software_genre, Applied Microbiology and Biotechnology, Upgrade, Molecular Medicine, Databases, Protein, computer, Biotechnology
Abstract: The Brookhaven Protein Data Bank was once a pioneering database, but its organization of structural data is now outdated and in need of an upgrade.
Published: 2007

46. Author Privacy, Data Fabrication, and Knowledge Discovery in Databases

Author: Ross D. King, May O. Lwin, and Jerome D. Williams
Subjects: Information privacy, Database, business.industry, Privacy software, Data stream mining, Computer science, computer.software_genre, Computer security, Data science, Knowledge extraction, Software deployment, Market data, The Internet, business, computer, Anonymity
Abstract: The problem of data fabrication, due to heightened consumer concerns about privacy, is on the rise. The unique characteristic of the Internet, anonymity, is a probable contributor to the intention of users to fabricate information. We propose a technological solution to this problem based on the deployment of knowledge discovery in database (KDD) systems to learn discrimination functions that discriminate between correct and fabricated data. These discrimination functions can then be used to form filters that remove falsified data from marketing data. That such discrimination functions are possible is due to the characteristic form falsified data takes. The greatest hurdle to implementing this approach is the availability of data labeled as "falsified" and "correct." However, the proposed technological solution offers potential to marketers and businesses alike
Published: 2006
Full Text: View/download PDF

47. An ontology of scientific experiments

Author: Ross D. King and Larisa N. Soldatova
Subjects: Information retrieval, Computer science, Ontology-based data integration, Process ontology, Research, Biomedical Engineering, Biophysics, Suggested Upper Merged Ontology, Bioengineering, Ontology (information science), Ontology language, Classification, Biochemistry, Biomaterials, Open Biomedical Ontologies, Ontology components, Upper ontology, Biotechnology, Research Article
Abstract: The formal description of experiments for efficient analysis, annotation and sharing of results is a fundamental part of the practice of science. Ontologies are required to achieve this objective. A few subject-specific ontologies of experiments currently exist. However, despite the unity of scientific experimentation, no general ontology of experiments exists. We propose the ontology EXPO to meet this need. EXPO links the SUMO (the Suggested Upper Merged Ontology) with subject-specific ontologies of experiments by formalizing the generic concepts of experimental design, methodology and results representation. EXPO is expressed in the W3C standard ontology language OWL-DL. We demonstrate the utility of EXPO and its ability to describe different experimental domains, by applying it to two experiments: one in high-energy physics and the other in phylogenetics. The use of EXPO made the goals and structure of these experiments more explicit, revealed ambiguities, and highlighted an unexpected similarity. We conclude that, EXPO is of general value in describing experiments and a step towards the formalization of science.
Published: 2006

48. An ontology for a Robot Scientist

Author: Amanda Clare, Larisa N. Soldatova, Ross D. King, and Andrew Sparkes
Subjects: Statistics and Probability, Saccharomyces cerevisiae Proteins, Databases, Factual, computer.internet_protocol, Computer science, media_common.quotation_subject, Science, Cell Culture Techniques, Information Storage and Retrieval, Documentation, Saccharomyces cerevisiae, Ontology (information science), Biochemistry, World Wide Web, Annotation, Text mining, Human–computer interaction, Artificial Intelligence, Representation (mathematics), Function (engineering), Molecular Biology, media_common, Natural Language Processing, business.industry, Research, Experimental data, Robotics, Computer Science Applications, Metadata, Computational Mathematics, Computational Theory and Mathematics, Vocabulary, Controlled, Research Design, Ontology, Robot, Database Management Systems, business, computer, XML
Abstract: Motivation: A Robot Scientist is a physically implemented robotic system that can automatically carry out cycles of scientific experimentation. We are commissioning a new Robot Scientist designed to investigate gene function in S. cerevisiae. This Robot Scientist will be capable of initiating >1,000 experiments, and making >200,000 observations a day. Robot Scientists provide a unique test bed for the development of methodologies for the curation and annotation of scientific experiments: because the experiments are conceived and executed automatically by computer, it is possible to completely capture and digitally curate all aspects of the scientific process. This new ability brings with it significant technical challenges. To meet these we apply an ontology driven approach to the representation of all the Robot Scientist’s data and metadata. Results: We demonstrate the utility of developing an ontology for our new Robot Scientist. This ontology is based on a general ontology of experiments. The ontology aids the curation and annotating of the experimental data and metadata, and the equipment metadata, and supports the design of database systems to hold the data and metadata. Availability: EXPO in XML and OWL formats is at: . All materials about the Robot Scientist project are available at: . Contact: lss@aber.ac.uk
Published: 2006

49. The Robot Scientist Project

Author: Jem J. Rowland, Kenneth E. Whelan, Michael Young, Ross D. King, and Amanda Clare
Subjects: biology, Computer science, business.industry, media_common.quotation_subject, Saccharomyces cerevisiae, Robotics, Metabolism, biology.organism_classification, Machine learning, computer.software_genre, Automation, Yeast, Prolog, Inductive logic programming, Knowledge extraction, Robot, Artificial intelligence, Function (engineering), business, Gene, computer, media_common, computer.programming_language
Abstract: We are interested in the automation of science for both philosophical and technological reasons. To this end we have built the first automated system that is capable of automatically: originating hypotheses to explain data, devising experiments to test these hypotheses, physically running these experiments using a laboratory robot, interpreting the results, and then repeat the cycle. We call such automated systems “Robot Scientists”. We applied our first Robot Scientist to predicting the function of genes in a well-understood part of the metabolism of the yeast S. cerevisiae. For background knowledge, we built a logical model of metabolism in Prolog. The experiments consisted of growing mutant yeast strains with known genes knocked out on specified growth media. The results of these experiments allowed the Robot Scientist to test hypotheses it had abductively inferred from the logical model. In empirical tests, the Robot Scientist experiment selection methodology outperformed both randomly selecting experiments, and a greedy strategy of always choosing the experiment of lowest cost; it was also as good as the best humans tested at the task. To extend this proof of principle result to the discovery of novel knowledge we require new hardware that is fully automated, a model of all of the known metabolism of yeast, and an efficient way of inferring probable hypotheses. We have made progress in all of these areas, and we are currently 6building a new Robot Scientist that we hope will be able to automatically discover new biological knowledge.
Published: 2005
Full Text: View/download PDF

50. Intelligent software for laboratory automation

Author: Ken Whelan and Ross D. King
Subjects: Computer science, Economics, Statistics as Topic, Bioengineering, Saccharomyces cerevisiae, Models, Biological, Task (project management), Management Information Systems, Amino Acids, Aromatic, Automation, Software, Artificial Intelligence, Biological sciences, business.industry, Research, Systems Integration, Research Design, Laboratory automation, Robot, Software engineering, business, Closed loop, Algorithms, Biotechnology
Abstract: The automation of laboratory techniques has greatly increased the number of experiments that can be carried out in the chemical and biological sciences. Until recently, this automation has focused primarily on improving hardware. Here we argue that future advances will concentrate on intelligent software to integrate physical experimentation and results analysis with hypothesis formulation and experiment planning. To illustrate our thesis, we describe the 'Robot Scientist' - the first physically implemented example of such a closed loop system. In the Robot Scientist, experimentation is performed by a laboratory robot, hypotheses concerning the results are generated by machine learning and experiments are allocated and selected by a combination of techniques derived from artificial intelligence research. The performance of the Robot Scientist has been evaluated by a rediscovery task based on yeast functional genomics. The Robot Scientist is proof that the integration of programmable laboratory hardware and intelligent software can be used to develop increasingly automated laboratories.
Published: 2004

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

82 results on '"Ross D. King"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources