Robin A. Richardson, Timothy J. Griffin, Hailiang Mei, Jon Ison, Pratik D. Jagtap, A. Amor, Ilkay Altintas, Christopher J. O. Baker, Magnus Palmblad, Szoke Szaniszlo, Tobias Kuhn, Carole Goble, Suzan Verberne, Anna-Lena Lamprecht, Paulos Charonyktakis, Hans Ienasescu, Salvador Capella-Gutierrez, Matúš Kalaš, Michael R. Crusoe, Aswin Verhoeven, Steffen Möller, Katherine Wolstencroft, Yolanda Gil, Vedran Kasalica, Hervé Ménager, Vincent Robert, Stian Soiland-Reyes, Alireza Khanteymoori, Paul Groth, Robert Stevens, Mohammad Sadnan Al Manir, Veit Schwämmle, Algorithmic Data Science (IVI, FNWI), Intelligent Data Engineering Lab (IvI, FNWI), Utrecht University [Utrecht], Leiden University Medical Center (LUMC), Universiteit Leiden, Institut Français de Bioinformatique (IFB-CORE), Institut National de Recherche en Informatique et en Automatique (Inria)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE), University of Southern Denmark (SDU), University of Virginia, University of California [San Diego] (UC San Diego), University of California (UC), University of New Brunswick (UNB), Westerdijk Fungal Biodiversity Institute [Utrecht] (WI), Royal Netherlands Academy of Arts and Sciences (KNAW), Barcelona Supercomputing Center - Centro Nacional de Supercomputacion (BSC - CNS), Gnosis Data Analysis PC, Vrije Universiteit Amsterdam [Amsterdam] (VU), University of Southern California (USC), University of Manchester [Manchester], School of Computer Science [Manchester], University of Minnesota System, University of Amsterdam [Amsterdam] (UvA), Danmarks Tekniske Universitet = Technical University of Denmark (DTU), University of Bergen (UiB), University of Freiburg [Freiburg], Institut Pasteur [Paris] (IP), University Medical Center Rostock, Netherlands eScience Center, Leiden Institute of Advanced Computer Science [Leiden] (LIACS), Stian Soiland-Reyes was supported by BioExcel-2 Centre of Excellence, funded by European Commission Horizon 2020 programme under European Commission contract H2020-INFRAEDI-02-2018 823830.Carole Goble was supported by EOSC-Life, funded by European Commission Horizon 2020 programme under grant agreement H2020-INFRAEOSC-2018-2 824087.We gratefully acknowledge the financial support from the Lorentz Center, ELIXIR, and the Leiden University Medical Center (LUMC) that made the workshop possible., European Project: 823830,H2020-EU.1.4.1.3. Development, deployment and operation of ICT-based e-infrastructures, H2020-EU.1.4. EXCELLENT SCIENCE - Research Infrastructures ,BioExcel-2(2019), European Project: 824087,EOSC-Life, Computer Systems, Network Institute, Business Web and Media, Intelligent Information Systems, Westerdijk Fungal Biodiversity Institute, Westerdijk Fungal Biodiversity Institute - Software and Databasing, Universiteit Leiden [Leiden], Ménager, Hervé, BioExcel Centre of Excellence for ComputationalBiomolecular Research - BioExcel-2 - - H2020-EU.1.4.1.3. Development, deployment and operation of ICT-based e-infrastructures, H2020-EU.1.4. EXCELLENT SCIENCE - Research Infrastructures 2019-01-01 - 2021-12-31 - 823830 - VALID, and EOSC-Life - Providing an open collaborative space for digital biology in Europe - EOSC-Life - 824087 - INCOMING
Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the “big picture” of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.