Author: "Universität Leipzig" / Topic: ddc:000 - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Universität Leipzig"' showing total 379 results

Start Over Author "Universität Leipzig" Topic ddc:000

379 results on '"Universität Leipzig"'

1. Übersicht über die Promotionen an der Fakultät für Mathematik und Informatik der Universität Leipzig von 1993 bis 1997

Author: Universität Leipzig
Subjects: Mathematik, ddc:000, Informatik, Datenverarbeitung, ddc:500
Published: 2004

2. Übersicht über die Promotionen an der Fakultät für Mathematik und Informatik der Universität Leipzig von 1998 bis 2000

Author: Universität Leipzig
Subjects: Mathematik, ddc:000, Informatik, Datenverarbeitung, ddc:500
Published: 2004

3. Übersicht über die Habilitationen an der Fakultät für Mathematik und Informatik der Universität Leipzig von 1993 bis 1997

Author: Universität Leipzig
Subjects: Mathematik, ddc:000, Informatik, Datenverarbeitung, ddc:500
Published: 1999

4. Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization

Author: Yousef, Tariq and Universität Leipzig
Subjects: Text Alignment, Digital Humanities, Natural Language Processing, Visualization, Machine Learning, Language Modelling, Annotation Tool, ddc:000
Abstract: Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages.
Published: 2023

5. Two-Stage Vehicle Routing Problems with Profits and Buffers: Analysis and Metaheuristic Optimization Algorithms

Author: Le, Hoang Thanh and Universität Leipzig
Subjects: ddc:000, Vehicle Routing, optimization, metaheuristics, complexity, NP-hardness, local search, Tourenplanung, Optimierung, Metaheuristik, Komplexität, NP-Schwere, lokale Suche
Abstract: This thesis considers the Two-Stage Vehicle Routing Problem (VRP) with Profits and Buffers, which generalizes various optimization problems that are relevant for practical applications, such as the Two-Machine Flow Shop with Buffers and the Orienteering Problem. Two optimization problems are considered for the Two-Stage VRP with Profits and Buffers, namely the minimization of total time while respecting a profit constraint and the maximization of total profit under a budget constraint. The former generalizes the makespan minimization problem for the Two-Machine Flow Shop with Buffers, whereas the latter is comparable to the problem of maximizing score in the Orienteering Problem. For the three problems, a theoretical analysis is performed regarding computational complexity, existence of optimal permutation schedules (where all vehicles traverse the same nodes in the same order) and potential gaps in attainable solution quality between permutation schedules and non-permutation schedules. The obtained theoretical results are visualized in a table that gives an overview of various subproblems belonging to the Two-Stage VRP with Profits and Buffers, their theoretical properties and how they are connected. For the Two-Machine Flow Shop with Buffers and the Orienteering Problem, two metaheuristics 2BF-ILS and VNSOP are presented that obtain favorable results in computational experiments when compared to other state-of-the-art algorithms. For the Two-Stage VRP with Profits and Buffers, an algorithmic framework for Iterative Search Algorithms with Variable Neighborhoods (ISAVaN) is proposed that generalizes aspects from 2BF-ILS as well as VNSOP. Various algorithms derived from that framework are evaluated in an experimental study. The evaluation methodology used for all computational experiments in this thesis takes the performance during the run time into account and demonstrates that algorithms for structurally different problems, which are encompassed by the Two-Stage VRP with Profits and Buffers, can be evaluated with similar methods. The results show that the most suitable choice for the components in these algorithms is dependent on the properties of the problem and the considered evaluation criteria. However, a number of similarities to algorithms that perform well for the Two-Machine Flow Shop with Buffers and the Orienteering Problem can be identified. The framework unifies these characteristics, providing a spectrum of algorithms that can be adapted to the specifics of the considered Vehicle Routing Problem.:1 Introduction 2 Background 2.1 Problem Motivation 2.2 Formal Definition of the Two-Stage VRP with Profits and Buffers 2.3 Review of Literature on Related Vehicle Routing Problems 2.3.1 Two-Stage Vehicle Routing Problems 2.3.2 Vehicle Routing Problems with Profits 2.3.3 Vehicle Routing Problems with Capacity- or Resource-based Restrictions 2.4 Preliminary Remarks on Subsequent Chapters 3 The Two-Machine Flow Shop Problem with Buffers 3.1 Review of Literature on Flow Shop Problems with Buffers 3.1.1 Algorithms and Metaheuristics for Flow Shops with Buffers 3.1.2 Two-Machine Flow Shop Problems with Buffers 3.1.3 Blocking Flow Shops 3.1.4 Non-Permutation Schedules 3.1.5 Other Extensions and Variations of Flow Shop Problems 3.2 Theoretical Properties 3.2.1 Computational Complexity 3.2.2 The Existence of Optimal Permutation Schedules 3.2.3 The Gap Between Permutation Schedules an Non-Permutation 3.3 A Modification of the NEH Heuristic 3.4 An Iterated Local Search for the Two-Machine Flow Shop Problem with Buffers 3.5 Computational Evaluation 3.5.1 Algorithms for Comparison 3.5.2 Generation of Problem Instances 3.5.3 Parameter Values 3.5.4 Comparison of 2BF-ILS with other Metaheuristics 3.5.5 Comparison of 2BF-OPT with NEH 3.6 Summary 4 The Orienteering Problem 4.1 Review of Literature on Orienteering Problems 4.2 Theoretical Properties 4.3 A Variable Neighborhood Search for the Orienteering Problem 4.4 Computational Evaluation 4.4.1 Measurement of Algorithm Performance 4.4.2 Choice of Algorithms for Comparison 4.4.3 Problem Instances 4.4.4 Parameter Values 4.4.5 Experimental Setup 4.4.6 Comparison of VNSOP with other Metaheuristics 4.5 Summary 5 The Two-Stage Vehicle Routing Problem with Profits and Buffers 5.1 Theoretical Properties of the Two-Stage VRP with Profits and Buffers 5.1.1 Computational Complexity of the General Problem 5.1.2 Existence of Permutation Schedules in the Set of Optimal Solutions 5.1.3 The Gap Between Permutation Schedules an Non-Permutation Schedules 5.1.4 Remarks on Restricted Cases 5.1.5 Overview of Theoretical Results 5.2 A Metaheuristic Framework for the Two-Stage VRP with Profits and Buffers 5.3 Experimental Results 5.3.1 Problem Instances 5.3.2 Experimental Results for O_{max R, Cmax≤B} 5.3.3 Experimental Results for O_{min Cmax, R≥Q} 5.4 Summary Bibliography List of Figures List of Tables List of Algorithms
Published: 2022

6. SKIL 2012 - Dritte Studentenkonferenz Informatik Leipzig: Dritte Studentenkonferenz Informatik Leipzig 2012Leipzig, Deutschland, 25. September 2012Tagungsband

Author: Schmidt, Johannes, Riechert, Thomas, Auer, Sören, Institut für Angewandte Informatik (InfAI) e. V. an der Universität Leipzig, Leipziger Informatik-Verbund (LIV) an der Universit¨at Leipzig, and Universität Leipzig
Subjects: Studentenkonferenz, Informatik, Studierende, Konferenz, Leipzig, ddc:000, Student Conference Leipzig
Abstract: Die Studentenkonferenz des Instituts für Informatik der Universität Leipzig richtet sich an alle Studierende der Informatik sowie angrenzender Disziplinen mit Schwerpunkt Informatik. Die Konferenz setzt sich zum Ziel, Studierenden eine Plattform zu bieten, ihre Projekte und Forschungsvorhaben vorzustellen. Im Mittelpunkt der Tagung stehen studentische Projekte aus Seminaren, Praktika, Abschlussarbeiten oder extracurricularen Aktivitäten. Die SKIL bietet den Studierenden die Möglichkeit, vor einem akademischen Publikum Ideen, Pläne und Ergebnisse zu präsentieren und zu diskutieren. Die Organisation der Konferenz unterscheidet sich nur wenig von wissenschaftlichen Fachkonferenzen. Die Einreichung der Beiträge erfolgte mit Hilfe eines Conference Systems; anschließend wurden alle Einreichungen durch das Programmkomitee bewertet. Angenommene Beiträge wurden am Konferenztag vorgestellt und in dem hier vorliegendem Tagungsband publiziert. Die dritte Studentenkonferenz Informatik Leipzig 2012 fand im Rahmen der SABRE am 25.09.2012 statt. Die SABRE ist eine internationale integrative Multikonferenz auf dem Gebiet zukunftsweisender Technologien der Softwareentwicklung, Agententechnologien und Servicecomputing für Wirtschaft, Entwicklung und Wissenschaft. Die SKIL 2012 wurde am Institut für Angewandte Informatik (InfAI) e.V. organisiert. Initiiert und maßgeblich realisiert wurde die SKIL 2012 von den Forschungsgruppen Agile Knowledge Engineering and Semantic Web (AKSW) und Service Science and Technology (SeSaT) der Universität Leipzig.
Published: 2012

7. Entwicklungspfad Service Engineering 2.0: Neue Perspektiven für die Dienstleistungsentwicklung

Author: Meyer, Kyrill, Böttcher, Martin, Institut für Angewandte Informatik (InfAI) e.V. an der Universität Leipzig, and Universität Leipzig
Subjects: Service Engienering, Service Science, New Service Development, Service Engineering, Dienstleistungsentwicklung, Dienstleistung, ddc:000
Abstract: Das Service Engineering hat sich innerhalb der letzten Jahre zunehmend als eine wissenschaftliche Fachdisziplin etabliert. Das vorliegende Heft bietet in kompakter Form einen kurzen Abriss über die Entwicklung dieses Bereiches der Wissenschaft und kondensiert die Grundideen und bisher in der Forschung betrachteten Aspekte. Darüber hinaus erfolgt eine Auseinandersetzung mit neuen Anforderungen seitens der veränderten Dienstleistungswirtschaft. Aus diesen ergibt sich, dass bisherige Ansätze des Service Engineerings grundlegend weiterentwickelt und neue Ansätze bereitgestellt werden müssen. Diese bilden die Entwicklungspfade für ein Service Engineering 2.0.
Published: 2012

8. Systemintegration: eine qualitative Erhebung aus der Sicht von Integrationsdienstleistern

Author: Gebauer, Martin, Stefan, Fred, Universität Leipzig, and Institut für Angewandte Informatik (InfAI) e.V. an der Universität Leipzig
Subjects: ddc:000, Systemsintegration, Integration Engineering, qualitative Study, Integration Services, Systemintegration, Integration Engineering, qualitative Erhebung, Integrationsdienstleistung
Abstract: Systemintegration hat auf Grund des Einsatzes heterogener Systeme eine fortlaufende Bedeutung für viele Unternehmen und vor allem für Integrationsdienstleister. Die Praxis der Integration stimmt nicht immer mit den aktuellen Forschungsthemen überein. Diese qualitative Erhebung liefert einen Zustandsbericht über Herausforderungen und Probleme denen Integrationsdienstleister täglich begegnen und dient der Identifikation von praxisrelevanten Forschungsfragen.
Published: 2011

9. From RNA folding to inverse folding: a computational study: Folding and design of RNA molecules

Author: Nono Saha, Cyrille Merleau and Universität Leipzig
Subjects: RNA folding, RNA design, FFT, Evolutionary algorithm, Lévy flights, pseudoknots, RNA kinetics, ddc:000
Abstract: Since the discovery of the structure of DNA in the early 1953s and its double-chained complement of information hinting at its means of replication, biologists have recognized the strong connection between molecular structure and function. In the past two decades, there has been a surge of research on an ever-growing class of RNA molecules that are non-coding but whose various folded structures allow a diverse array of vital functions. From the well-known splicing and modification of ribosomal RNA, non-coding RNAs (ncRNAs) are now known to be intimately involved in possibly every stage of DNA translation and protein transcription, as well as RNA signalling and gene regulation processes. Despite the rapid development and declining cost of modern molecular methods, they typically can only describe ncRNA's structural conformations in vitro, which differ from their in vivo counterparts. Moreover, it is estimated that only a tiny fraction of known ncRNAs has been documented experimentally, often at a high cost. There is thus a growing realization that computational methods must play a central role in the analysis of ncRNAs. Not only do computational approaches hold the promise of rapidly characterizing many ncRNAs yet to be described, but there is also the hope that by understanding the rules that determine their structure, we will gain better insight into their function and design. Many studies revealed that the ncRNA functions are performed by high-level structures that often depend on their low-level structures, such as the secondary structure. This thesis studies the computational folding mechanism and inverse folding of ncRNAs at the secondary level. In this thesis, we describe the development of two bioinformatic tools that have the potential to improve our understanding of RNA secondary structure. These tools are as follows: (1) RAFFT for efficient prediction of pseudoknot-free RNA folding pathways using the fast Fourier transform (FFT)}; (2) aRNAque, an evolutionary algorithm inspired by Lévy flights for RNA inverse folding with or without pseudoknot (A secondary structure that often poses difficulties for bio-computational detection). The first tool, RAFFT, implements a novel heuristic to predict RNA secondary structure formation pathways that has two components: (i) a folding algorithm and (ii) a kinetic ansatz. When considering the best prediction in the ensemble of 50 secondary structures predicted by RAFFT, its performance matches the recent deep-learning-based structure prediction methods. RAFFT also acts as a folding kinetic ansatz, which we tested on two RNAs: the CFSE and a classic bi-stable sequence. In both test cases, fewer structures were required to reproduce the full kinetics, whereas known methods (such as Treekin) required a sample of 20,000 structures and more. The second tool, aRNAque, implements an evolutionary algorithm (EA) inspired by the Lévy flight, allowing both local global search and which supports pseudoknotted target structures. The number of point mutations at every step of aRNAque's EA is drawn from a Zipf distribution. Therefore, our proposed method increases the diversity of designed RNA sequences and reduces the average number of evaluations of the evolutionary algorithm. The overall performance showed improved empirical results compared to existing tools through intensive benchmarks on both pseudoknotted and pseudoknot-free datasets. In conclusion, we highlight some promising extensions of the versatile RAFFT method to RNA-RNA interaction studies. We also provide an outlook on both tools' implications in studying evolutionary dynamics.
Published: 2022

10. Inferring developmental trajectories in gene and cell state space

Author: Schmidt, Maria and Universität Leipzig
Subjects: Trajectory, Pseudotime, Sample state space, Feature state space, OMICs, ddc:000
Abstract: A trajectory in physics is the path that an object covers in space and time. In analogy, trajectories in biology can be understood as paths that biological entities such as cells or organisms cover between developmental or evolutionary states in space and time. Trajectory inference methods have been developed to analyze molecular ‘snapshot’ data obtained from cross-sectional OMICs experiments to deduce a proxy of temporal processes like cellular development or evolution of an organism that is otherwise often not available. In this thesis, the concept of OMICs-based trajectories was addressed in the context of three topics: (i) Trajectories in single-cell or bulk transcriptomic state space to infer the development and differentiation of cells in a multicellular organism (planaria) as an example for differentiation from stem cells towards differentiated tissues and of the development of treatment resistance of melanoma cell lines under targeted treatment. Further, we analyze the aging of the blood transcriptome in a large population cohort. (ii) Transferring the concept of trajectories from cell state space into gene state space to infer the switching of genomic programs during development. (iii) Transferring the concept of trajectories from transcriptome state space to genetic (mutational) state-space to infer evolutionary paths with applications to SARS-COV2 virus evolution over two years in 2020 and 2021, and dissemination of vine-varieties over about 8,000 years from the (presumably) first cultivation of vine in the Caucasus region around the Mediterranean Sea and Western Europe. The main aim of this thesis is to develop, adapt, and apply computational methods to reconstruct developmental trajectories across different OMICs domains in cell and in feature space. The latter one, trajectories in feature state-space, can be modeled using self-organizing maps (SOM) machine learning. The method transforms multidimensional gene expression patterns into two-dimensional data landscapes resembling the metaphoric Waddington epigenetic landscape. Trajectories in this SOM landscape visualize transcriptional programs passed by cells along their line of development from undifferentiated to differentiated cells and evolutionary paths over several generations of species or cultivars. As a proof-of-principle study, we applied our method to a single-cell transcriptomics data set describing tissue development in a regenerating flatworm. Using SOM machine learning, we identified cell type and tissue-specific modules of activated genes, as well as differentiation trajectories linking stemness-source and tissue-related sink modules. Our methods developed in this thesis enable a novel, more comprehensive view on the dynamics in high-dimensional OMICs data landscapes. The variety of examples chosen illustrates the mutual dependencies as well as the particularities of each state and data type. Our analyses support the understanding of the dynamics of the data in the context of their developmental, functional, geographical, or historical stage. By applying trajectory inference methods to cancer and SARS-CoV-2 data sets, we could confirm and supplement previous findings on the dynamics of important underlying transcriptional and genomic programs.:Table of Contents Abstract 5 1 Introduction 9 1.1 Developmental trajectories 9 1.2 Measuring gene and cell States: challenges in high-throughput data analysis 9 1.3 Single-cell RNA sequencing (scRNA-seq) data and trajectory inference 10 1.4 Trajectories in cell and gene states 13 1.5 Single-cell self-organizing map - portrayal 13 1.6 Objectives and outline 15 1.7 Biological background of OMICs terminologies 16 1.8 Bulk OMICs data 17 1.9 Single-cell OMICs 19 2 Bioinformatics of single-cell RNA sequencing 19 2.1 Processing steps 19 2.2 Single-cell SOM portrayal 23 3 Methods: Pseudotime trajectories in gene and cell state space 25 3.1 Trajectories in cell state space 25 3.2 Trajectories in gene state space 27 3.3 RNA Velocity in gene and cell state space 27 3.4 Software 28 3.5 Application in case studies 28 3.6 Worked example: Trajectories of planarian tissue development 29 4 Application I: Trajectories of melanoma treatment resistance 44 4.1 Results 44 4.2 Discussion 54 4.3 Conclusion 57 5 Application II: Trajectories of the aging blood 59 5.1 Results 59 5.2 Discussion 75 5.3 Conclusion 80 6 Application III: Trajectories of the evolving SARS-CoV-2 genome 81 6.1 Results 82 6.2 Discussion 101 6.3 Conclusion 105 7 Application IV: Trajectories of grapevine dissemination 106 7.1 Results 106 7.2 Discussion 115 7.3 Conclusion 119 8 General Discussion 120 8.1 Trajectory inference across OMICs modalities and timescales 120 8.2 Challenges of trajectory inference of bulk and genomic data 121 8.3 Trajectory inference in cell and gene state space 122 9 Conclusion 125 10 Supplement 126 10.1 Materials and methods 126 10.2 Results 138 Index of abbreviations 207 List of tables 211 List of figures 212 References 215 Curriculum vitae 233 List of publications 234 Conferences / summer schools 236 Selbstständigkeitserklärung 237
Published: 2022

11. Differential evolution of non-coding DNA across eukaryotes and its close relationship with complex multicellularity on Earth

Author: Lozada Chávez, Irma and Universität Leipzig
Subjects: ddc:000, genome, non-coding-DNA, multicellularity, introns, ncRNA
Abstract: Here, I elaborate on the hypothesis that complex multicellularity (CM, sensu Knoll) is a major evolutionary transition (sensu Szathmary), which has convergently evolved a few times in Eukarya only: within red and brown algae, plants, animals, and fungi. Paradoxically, CM seems to correlate with the expansion of non-coding DNA (ncDNA) in the genome rather than with genome size or the total number of genes. Thus, I investigated the correlation between genome and organismal complexities across 461 eukaryotes under a phylogenetically controlled framework. To that end, I introduce the first formal definitions and criteria to distinguish ‘unicellularity’, ‘simple’ (SM) and ‘complex’ multicellularity. Rather than using the limited available estimations of unique cell types, the 461 species were classified according to our criteria by reviewing their life cycle and body plan development from literature. Then, I investigated the evolutionary association between genome size and 35 genome-wide features (introns and exons from protein-coding genes, repeats and intergenic regions) describing the coding and ncDNA complexities of the 461 genomes. To that end, I developed ‘GenomeContent’, a program that systematically retrieves massive multidimensional datasets from gene annotations and calculates over 100 genome-wide statistics. R-scripts coupled to parallel computing were created to calculate >260,000 phylogenetic controlled pairwise correlations. As previously reported, both repetitive and non-repetitive DNA are found to be scaling strongly and positively with genome size across most eukaryotic lineages. Contrasting previous studies, I demonstrate that changes in the length and repeat composition of introns are only weakly or moderately associated with changes in genome size at the global phylogenetic scale, while changes in intron abundance (within and across genes) are either not or only very weakly associated with changes in genome size. Our evolutionary correlations are robust to: different phylogenetic regression methods, uncertainties in the tree of eukaryotes, variations in genome size estimates, and randomly reduced datasets. Then, I investigated the correlation between the 35 genome-wide features and the cellular complexity of the 461 eukaryotes with phylogenetic Principal Component Analyses. Our results endorse a genetic distinction between SM and CM in Archaeplastida and Metazoa, but not so clearly in Fungi. Remarkably, complex multicellular organisms and their closest ancestral relatives are characterized by high intron-richness, regardless of genome size. Finally, I argue why and how a vast expansion of non-coding RNA (ncRNA) regulators rather than of novel protein regulators can promote the emergence of CM in Eukarya. As a proof of concept, I co-developed a novel ‘ceRNA-motif pipeline’ for the prediction of “competing endogenous” ncRNAs (ceRNAs) that regulate microRNAs in plants. We identified three candidate ceRNAs motifs: MIM166, MIM171 and MIM159/319, which were found to be conserved across land plants and be potentially involved in diverse developmental processes and stress responses. Collectively, the findings of this dissertation support our hypothesis that CM on Earth is a major evolutionary transition promoted by the expansion of two major ncDNA classes, introns and regulatory ncRNAs, which might have boosted the irreversible commitment of cell types in certain lineages by canalizing the timing and kinetics of the eukaryotic transcriptome.:Cover page Abstract Acknowledgements Index 1. The structure of this thesis 1.1. Structure of this PhD dissertation 1.2. Publications of this PhD dissertation 1.3. Computational infrastructure and resources 1.4. Disclosure of financial support and information use 1.5. Acknowledgements 1.6. Author contributions and use of impersonal and personal pronouns 2. Biological background 2.1. The complexity of the eukaryotic genome 2.2. The problem of counting and defining “genes” in eukaryotes 2.3. The “function” concept for genes and “dark matter” 2.4. Increases of organismal complexity on Earth through multicellularity 2.5. Multicellularity is a “fitness transition” in individuality 2.6. The complexity of cell differentiation in multicellularity 3. Technical background 3.1. The Phylogenetic Comparative Method (PCM) 3.2. RNA secondary structure prediction 3.3. Some standards for genome and gene annotation 4. What is in a eukaryotic genome? GenomeContent provides a good answer 4.1. Background 4.2. Motivation: an interoperable tool for data retrieval of gene annotations 4.3. Methods 4.4. Results 4.5. Discussion 5. The evolutionary correlation between genome size and ncDNA 5.1. Background 5.2. Motivation: estimating the relationship between genome size and ncDNA 5.3. Methods 5.4. Results 5.5. Discussion 6. The relationship between non-coding DNA and Complex Multicellularity 6.1. Background 6.2. Motivation: How to define and measure complex multicellularity across eukaryotes? 6.3. Methods 6.4. Results 6.5. Discussion 7. The ceRNA motif pipeline: regulation of microRNAs by target mimics 7.1. Background 7.2. A revisited protocol for the computational analysis of Target Mimics 7.3. Motivation: a novel pipeline for ceRNA motif discovery 7.4. Methods 7.5. Results 7.6. Discussion 8. Conclusions and outlook 8.1. Contributions and lessons for the bioinformatics of large-scale comparative analyses 8.2. Intron features are evolutionarily decoupled among themselves and from genome size throughout Eukarya 8.3. “Complex multicellularity” is a major evolutionary transition 8.4. Role of RNA throughout the evolution of life and complex multicellularity on Earth 9. Supplementary Data Bibliography Curriculum Scientiae Selbständigkeitserklärung (declaration of authorship)
Published: 2022

12. Exploration of Chemical Space: Formal, chemical and historical aspects

Author: Leal, Wilmer and Universität Leipzig
Subjects: ddc:000, Chemical space, categorical chemistry, hypernetworks, complex systems, directed hypergraphs, discrete curvature
Abstract: Starting from the observation that substances and reactions are the central entities of chemistry, I have structured chemical knowledge into a formal space called a directed hypergraph, which arises when substances are connected by their reactions. I call this hypernet chemical space. In this thesis, I explore different levels of description of this space: its evolution over time, its curvature, and categorical models of its compositionality. The vast majority of the chemical literature focuses on investigations of particular aspects of some substances or reactions, which have been systematically recorded in comprehensive databases such as Reaxys for the last 200 years. While complexity science has made important advances in physics, biology, economics, and many other fields, it has somewhat neglected chemistry. In this work, I propose to take a global view of chemistry and to combine complexity science tools, modern data analysis techniques, and geometric and compositional theories to explore chemical space. This provides a novel view of chemistry, its history, and its current status. We argue that a large directed hypergraph, that is, a model of directed relations between sets, underlies chemical space and that a systematic study of this structure is a major challenge for chemistry. Using the Reaxys database as a proxy for chemical space, we search for large-scale changes in a directed hypergraph model of chemical knowledge and present a data-driven approach to navigate through its history and evolution. These investigations focus on the mechanistic features by which this space has been expanding: the role of synthesis and extraction in the production of new substances, patterns in the selection of starting materials, and the frequency with which reactions reach new regions of chemical space. Large-scale patterns that emerged in the last two centuries of chemical history are detected, in particular, in the growth of chemical knowledge, the use of reagents, and the synthesis of products, which reveal both conservatism and sharp transitions in the exploration of the space. Furthermore, since chemical similarity of substances arises from affinity patterns in chemical reactions, we quantify the impact of changes in the diversity of the space on the formulation of the system of chemical elements. In addition, we develop formal tools to probe the local geometry of the resulting directed hypergraph and introduce the Forman-Ricci curvature for directed and undirected hypergraphs. This notion of curvature is characterized by applying it to social and chemical networks with higher order interactions, and then used for the investigation of the structure and dynamics of chemical space. The network model of chemistry is strongly motivated by the observation that the compositional nature of chemical reactions must be captured in order to build a model of chemical reasoning. A step forward towards categorical chemistry, that is, a formalization of all the flavors of compositionality in chemistry, is taken by the construction of a categorical model of directed hypergraphs. We lifted the structure from a lineale (a poset version of a symmetric monoidal closed category) to a category of Petri nets, whose wiring is a bipartite directed graph equivalent to a directed hypergraph. The resulting construction, based on the Dialectica categories introduced by Valeria De Paiva, is a symmetric monoidal closed category with finite products and coproducts, which provides a formal way of composing smaller networks into larger in such a way that the algebraic properties of the components are preserved in the resulting network. Several sets of labels, often used in empirical data modeling, can be given the structure of a lineale, including: stoichiometric coefficients in chemical reaction networks, reaction rates, inhibitor arcs, Boolean interactions, unknown or incomplete data, and probabilities. Therefore, a wide range of empirical data types for chemical substances and reactions can be included in our model.
Published: 2022

13. Advancing the analysis of bisulfite sequencing data in its application to ecological plant epigenetics

Author: Nunn, Adam and Universität Leipzig
Subjects: ddc:000, DNA methylation, Epigenetics, Genome assembly, Benchmarking, Pipeline
Abstract: The aim of this thesis is to bridge the gap between the state-of-the-art bioinformatic tools and resources, currently at the forefront of epigenetic analysis, and their emerging applications to non-model species in the context of plant ecology. New, high-resolution research tools are presented; first in a specific sense, by providing new genomic resources for a selected non-model plant species, and also in a broader sense, by developing new software pipelines to streamline the analysis of bisulfite sequencing data, in a manner which is applicable to a wide range of non-model plant species. The selected species is the annual field pennycress, Thlaspi arvense, which belongs in the same lineage of the Brassicaceae as the closely-related model species, Arabidopsis thaliana, and yet does not benefit from such extensive genomic resources. It is one of three key species in a Europe-wide initiative to understand how epigenetic mechanisms contribute to natural variation, stress responses and long-term adaptation of plants. To this end, this thesis provides a high-quality, chromosome-level assembly for T. arvense, alongside a rich complement of feature annotations of particular relevance to the study of epigenetics. The genome assembly encompasses a hybrid approach, involving both PacBio continuous long reads and circular consensus sequences, alongside Hi-C sequencing, PCR-free Illumina sequencing and genetic maps. The result is a significant improvement in contiguity over the existing draft state from earlier studies. Much of the basis for building an understanding of epigenetic mechanisms in non-model species centres around the study of DNA methylation, and in particular the analysis of bisulfite sequencing data to bring methylation patterns into nucleotide-level resolution. In order to maintain a broad level of comparison between T. arvense and the other selected species under the same initiative, a suite of software pipelines which include mapping, the quantification of methylation values, differential methylation between groups, and epigenome-wide association studies, have also been developed. Furthermore, presented herein is a novel algorithm which can facilitate accurate variant calling from bisulfite sequencing data using conventional approaches, such as FreeBayes or Genome Analysis ToolKit (GATK), which until now was feasible only with specifically-adapted software. This enables researchers to obtain high-quality genetic variants, often essential for contextualising the results of epigenetic experiments, without the need for additional sequencing libraries alongside. Each of these aspects are thoroughly benchmarked, integrated to a robust workflow management system, and adhere to the principles of FAIR (Findability, Accessibility, Interoperability and Reusability). Finally, further consideration is given to the unique difficulties presented by population-scale data, and a number of concepts and ideas are explored in order to improve the feasibility of such analyses. In summary, this thesis introduces new high-resolution tools to facilitate the analysis of epigenetic mechanisms, specifically relating to DNA methylation, in non-model plant data. In addition, thorough benchmarking standards are applied, showcasing the range of technical considerations which are of principal importance when developing new pipelines and tools for the analysis of bisulfite sequencing data. The complete “Epidiverse Toolkit” is available at https://github.com/EpiDiverse and will continue to be updated and improved in the future.:ABSTRACT ACKNOWLEDGEMENTS 1 INTRODUCTION 1.1 ABOUT THIS WORK 1.2 BIOLOGICAL BACKGROUND 1.2.1 Epigenetics in plant ecology 1.2.2 DNA methylation 1.2.3 Maintenance of 5mC patterns in plants 1.2.4 Distribution of 5mC patterns in plants 1.3 TECHNICAL BACKGROUND 1.3.1 DNA sequencing 1.3.2 The case for a high-quality genome assembly 1.3.3 Sequence alignment for NGS 1.3.4 Variant calling approaches 2 BUILDING A SUITABLE REFERENCE GENOME 2.1 INTRODUCTION 2.2 MATERIALS AND METHODS 2.2.1 Seeds for the reference genome development 2.2.2 Sample collection, library preparation, and DNA sequencing 2.2.3 Contig assembly and initial scaffolding 2.2.4 Re-scaffolding 2.2.5 Comparative genomics 2.3 RESULTS 2.3.1 An improved reference genome sequence 2.3.2 Comparative genomics 2.4 DISCUSSION 3 FEATURE ANNOTATION FOR EPIGENOMICS 3.1 INTRODUCTION 3.2 MATERIALS AND METHODS 3.2.1 Tissue preparation for RNA sequencing 3.2.2 RNA extraction and sequencing 3.2.3 Transcriptome assembly 3.2.4 Genome annotation 3.2.5 Transposable element annotations 3.2.6 Small RNA annotations 3.2.7 Expression atlas 3.2.8 DNA methylation 3.3 RESULTS 3.3.1 Transcriptome assembly 3.3.2 Protein-coding genes 3.3.3 Non-coding loci 3.3.4 Transposable elements 3.3.5 Small RNA 3.3.6 Pseudogenes 3.3.7 Gene expression atlas 3.3.8 DNA Methylation 3.4 DISCUSSION 4 BISULFITE SEQUENCING METHODS 4.1 INTRODUCTION 4.2 PRINCIPLES OF BISULFITE SEQUENCING 4.3 EXPERIMENTAL DESIGN 4.4 LIBRARY PREPARATION 4.4.1 Whole Genome Bisulfite Sequencing (WGBS) 4.4.2 Reduced Representation Bisulfite Sequencing (RRBS) 4.4.3 Target capture bisulfite sequencing 4.5 BIOINFORMATIC ANALYSIS OF BISULFITE DATA 4.5.1 Quality Control 4.5.2 Read Alignment 4.5.3 Methylation Calling 4.6 ALTERNATIVE METHODS 5 FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS 5.1 INTRODUCTION 5.2 MATERIALS AND METHODS 5.2.1 Reference species 5.2.2 Natural accessions 5.2.3 Read simulation 5.2.4 Read alignment 5.2.5 Mapping rates 5.2.6 Precision-recall 5.2.7 Coverage deviation 5.2.8 DNA methylation analysis 5.3 RESULTS 5.4 DISCUSSION 5.5 A PIPELINE FOR WGBS ANALYSIS 6 THERE AND BACK AGAIN: INFERRING GENOMIC INFORMATION 6.1 INTRODUCTION 6.1.1 Implementing a new approach 6.2 MATERIALS AND METHODS 6.2.1 Validation datasets 6.2.2 Read processing and alignment 6.2.3 Variant calling 6.2.4 Benchmarking 6.3 RESULTS 6.4 DISCUSSION 6.5 A PIPELINE FOR SNP VARIANT ANALYSIS 7 POPULATION-LEVEL EPIGENOMICS 7.1 INTRODUCTION 7.2 CHALLENGES IN POPULATION-LEVEL EPIGENOMICS 7.3 DIFFERENTIAL METHYLATION 7.3.1 A pipeline for case/control DMRs 7.3.2 A pipeline for population-level DMRs 7.4 EPIGENOME-WIDE ASSOCIATION STUDIES (EWAS) 7.4.1 A pipeline for EWAS analysis 7.5 GENOTYPING-BY-SEQUENCING (EPIGBS) 7.5.1 Extending the epiGBS pipeline 7.6 POPULATION-LEVEL HAPLOTYPES 7.6.1 Extending the EpiDiverse/SNP pipeline 8 CONCLUSION APPENDICES A. SUPPLEMENT: BUILDING A SUITABLE REFERENCE GENOME B. SUPPLEMENT: FEATURE ANNOTATION FOR EPIGENOMICS C. SUPPLEMENT: FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS D. SUPPLEMENT: INFERRING GENOMIC INFORMATION BIBLIOGRAPHY
Published: 2022

14. Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology

Author: Krämer, Stefan and Universität Leipzig
Subjects: ddc:000, Toxicogenomics, Machine Learning, Chemical Perturbation, Data Integration, Computational Toxicology
Abstract: Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data. The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups. This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base. However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects. Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I Abstract ... V Acknowledgements ... VII Prelude ... IX 1 Introduction 1.1 An overview of environmental toxicology ... 2 1.1.1 Environmental toxicology ... 2 1.1.2 Chemicals in the environment ... 4 1.1.3 Systems biological perspectives in environmental toxicology ... 7 Computational toxicology ... 11 1.2.1 Omics-based approaches ... 12 1.2.2 Linking chemical exposure to transcriptional effects ... 14 1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19 1.2.4 Biomedical literature-based discovery ... 24 1.2.5 Deep learning with knowledge representation ... 27 1.3 Research question and approaches ... 29 2 Methods and Data ... 33 2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34 2.1.1 Exposure and microarray data ... 34 2.1.2 Preprocessing ... 35 2.1.3 Differential gene expression ... 37 2.1.4 Association rule mining ... 38 2.1.5 Weighted gene correlation network analysis ... 39 2.1.6 Method comparison ... 41 Predicting exposure-related effects on a molecular level ... 44 2.2.1 Input ... 44 2.2.2 Input preparation ... 47 2.2.3 Deep learning models ... 49 2.2.4 Toxicogenomic application ... 54 3 Method comparison to link complex stream water exposures to effects on the transcriptional level ... 57 3.1 Background and motivation ... 58 3.1.1 Workflow ... 61 3.2 Results ... 62 3.2.1 Data preprocessing ... 62 3.2.2 Differential gene expression analysis ... 67 3.2.3 Association rule mining ... 71 3.2.4 Network inference ... 78 3.2.5 Method comparison ... 84 3.2.6 Application case of method integration ... 87 3.3 Discussion ... 91 3.4 Conclusion ... 99 4 Deep learning prediction of chemical-biomolecule interactions ... 101 4.1 Motivation ... 102 4.1.1Workflow ...105 4.2 Results ... 107 4.2.1 Input preparation ... 107 4.2.2 Model selection ... 110 4.2.3 Model comparison ... 118 4.2.4 Toxicogenomic application ... 121 4.2.5 Horizontal augmentation without tail-padding ...123 4.2.6 Four-class problem formulation ... 124 4.2.7 Training with CTD data ... 125 4.3 Discussion ... 129 4.3.1 Transferring biomedical knowledge towards toxicology ... 129 4.3.2 Deep learning with biomedical knowledge representation ...133 4.3.3 Data integration ...136 4.4 Conclusion ... 141 5 Conclusion and Future perspectives ... 143 5.1 Conclusion ... 143 5.1.1 Investigating complex mixtures in the environment ... 144 5.1.2 Complex knowledge from literature and curated databases predict chemical- biomolecule interactions ... 145 5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146 5.2 Future perspectives ... 147 S1 Supplement Chapter 1 ... 153 S1.1 Example of an estrogen bioassay ... 154 S1.2 Types of mode of action ... 154 S1.3 The dogma of molecular biology ... 157 S1.4 Transcriptomics ... 159 S2 Supplement Chapter 3 ... 161 S3 Supplement Chapter 4 ... 175 S3.1 Hyperparameter tuning results ... 176 S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179 S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183 S3.4 Horizontal augmentation without tail-padding ... 183 S3.5 Four-relationship classification ... 185 S3.6 Interpreting loss observations for SemMedDB trained models ... 187 List of Abbreviations ... i List of Figures ... vi List of Tables ... x Bibliography ... xii Curriculum scientiae ... xxxix Selbständigkeitserklärung ... xliii
Published: 2022

15. Workflows for the Large-Scale Assessment of miRNA Evolution: Birth and Death of miRNA Genes in Tunicates

Author: Velandia Huerto, Cristian Arley and Universität Leipzig
Subjects: ddc:000, miRNA, homology, tunicates, evolution, synteny
Abstract: As described over 20 years ago with the discovery of RNA interference (RNAi), double-stranded RNAs occupied key roles in regulation and as defense-line in animal cells. This thesis focuses on metazoan microRNAs (miRNAs). These small non-coding RNAs are distinguished from their small-interfering RNA (siRNA) relatives by their tightly controlled, efficient and flexible biogenesis, together with a broader flexibility to target multiple mRNAs by a seed imperfect base-pairing. As potent regulators, miRNAs are involved in mRNA stability and post-transcriptional regulation tasks, being a conserved mechanism used repetitively by the evolution, not only in metazoans, but plants and unicellular organisms. Through a comprehensive revision of the current animal miRNA model, the canonical pathway dominates the extensive literature about miRNAs, and served as a scaffold to understand the scenes behind the regulatory landscape performed by the cell. The characterization of a diverse set of non-canonical pathways has expanded this view, suggesting a diverse, rich and flexible regulatory landscape to generate mature miRNAs. The production of miRNAs, derived from isolated or clustered transcripts, is an efficient and highly conserved mechanism traced back to animals with high fidelity at family level. In evolutionary terms, expansions of miRNA families have been associated with an increasing morphological and developmental complexity. In particular, the Chordata clade (the ancient cephalochordates, highly derived and secondary simplified tunicates, and the well-known vertebrates) represents an interesting scenario to study miRNA evolution. Despite clearly conserved miRNAs along these clades, tunicates display massive restructuring events, including emergence of highly derived miRNAs. As shown in this thesis, model organisms or vertebrate-specific bias exist in current animal miRNA annotations, misrepresenting more diverse groups, such as marine invertebrates. Current miRNA databases, such as miRBase and Rfam, classified miRNAs under different definitions and possessed annotations that are not simple to be linked. As an alternative, this thesis proposes a method to curate and merge those annotations, making use of miRBase precursor/mature annotations and genomes together with Rfam predicted sequences. This approach generated structural models for shared miRNA families, based on the alignment of their correct-positioned mature sequences as anchors. In this process, the developed structural curation steps flagged 33 miRNA families from the Rfam as questionable. Curated Rfam and miRBase anchored-structural alignments provided a rich resource for constructing predictive miRNA profiles, using correspondent hidden Markov (HMMs) and covariance models (CMs). As a direct application, the use of those models is time-consuming, and the user has to deal with multiple iterations to achieve a genome-wide non-overlapping annotation. To resolve this, the proposed miRNAture pipeline provides an automatic and flexible solution to annotate miRNAs. It combines multiple homology approaches to generate the best candidates validated at sequence and structural levels. This increases the achievable sensitivity to annotate canonical miRNAs, and the evaluation against human annotation shows that clear false positive calls are rare and additional counterparts lie in retained-introns, transcribed lncRNAs or repeat families. Further development of miRNAture suggests an inclusion of multiple rules to distinguish non-canonical miRNA families. This thesis describes multiple homology approaches to annotate the genomic information from a non-model chordate: the colonial tunicate Didemnum vexillum. Detected high levels of genetic variance and unexpected levels of DNA degradation were evidenced through a comprehensive analysis of genome-assembly methods and gene annotation. Despite those challenges, it was possible to find candidate homeobox and skeletogenesis- related genes. On its own, the ncRNA annotation included expected conserved families, and an extensive search of the Rhabdomyosarcoma 2-associated transcript (RMST) lncRNA family traced-back at the divergence of deuterostomes. In addition, a complete study of the annotation thresholds suggested variations to detect miRNAs, later implemented on the miRNAture tool. This chapter is a showcase of the usual workflow that should follow comprehensive sequencing, assembly and annotation project, in the light of the increasing research approaching DNA sequencing. In the last 10 years, the remarkable increment in tunicate sequencing projects boosted the access to an expanded miRNA annotation landscape. In this way, a comprehensive homology approach annotated the miRNA complement of 28 deuterostome genomes (including current 16 reported tunicates) using miRNAture. To get proper structural models as input, corrected miRBase structural alignments served as a scaffold for building correspondent CMs, based on a developed genetic algorithm. By this means, this automatic approach selected the set of sequences that composed the alignments, generating 2492 miRNA CMs. Despite the multiple sources and associated heterogeneity of the studied genomes, a clustering approach successfully gathered five groups of similar assemblies and highlighted low quality assemblies. The overall family and loci reduction on tunicates is notorious, showing on average 374 microRNA (miRNA) loci, in comparison to other clades: Cephalochordata (2119), Vertebrata (3638), Hemichordata (1092) and Echinodermata (2737). Detection of 533 miRNA families on the divergence of tunicates shows an expanded landscape regarding currently miRNA annotated families. Shared sets of ancestral, chordates, Olfactores, and specific clade-specific miRNAs were uncovered using a phyloge- netic conservation criteria. Compared to current annotations, the family repertories were expanded in all cases. Finally, relying on the adjacent elements from annotated miRNAs, this thesis proposes an additional syntenic support to cluster miRNA loci. In this way, the structural alignment of miR-1497, originally annotated in three model tunicates, was expanded with a clear syntenic support on tunicates.
Published: 2022

16. Voice and silence in public debate: Modelling and observing collective opinion expression online

Author: Gaisbauer, Felix and Universität Leipzig
Subjects: ddc:000, Social networks, complex systems, computational social science
Abstract: This thesis investigates how group-level differences in willingness of opinion expression shape the extent to which certain standpoints are visible in public debate online. Against the backdrop of facilitated communication and connection to like-minded others through digital technologies, models and methods are developed and case studies are carried out – by and large from a network perspective. To this end, we first propose a model of opinion dynamics that examines social- structural conditions for public opinion expression or even predominance of different groups. The model focuses not on opinion change, but on the decision of individuals whether to express their opinion publicly or not. Groups of agents with different, fixed opinions interact with each other, changing the willingness to express their opinion according to the feedback they receive from others. We formulate the model as a multi-group game, and subsequently provide a dynamical systems perspective by introducing reinforcement learning dynamics. We show that a minority can dominate public discourse if its internal connections are sufficiently dense. Moreover, increased costs for opinion expression can drive even internally well-connected groups into silence. We then focus on how interaction networks can be used to infer political and social positions. For this purpose, we develop a new type of force-directed network layout algorithm. While being widely used, a rigorous interpretation of the outcomes of existing force-directed algorithms has not been provided yet. We argue that interpretability can be delivered by latent space approaches, which have the goal of embedding a network in an underlying social space. On the basis of such a latent space model, we derive a force-directed layout algorithm that can not only be used for the spatialisation of generic network data – exemplified by Twitter follower and retweet networks, as well as Facebook friendship networks – but also for the visualization of surveys. Comparison to existing layout algorithms (which are not grounded in an interpretable model) reveals that node groups are placed in similar configurations, while said algorithms show a stronger intra-cluster separation of nodes, as well as a tendency to separate clusters more strongly in retweet networks. In two case studies, we observe actual public debate on the social media platform Twitter – topics are the Saxon state elections 2019, and violent riots in the city of Leipzig on New Year’s Eve in the same year. We show that through the interplay of retweet and reply networks, it is possible to identify differences in willingness of opinion expression on the platform between opinion groups. We find that for both events, propensities to get involved in debate are asymmetric. Users retweeting far-right parties and politicians are significantly more active, making their positions disproportionately visible. Said users also act significantly more confrontational in the sense that they reply mostly to users from different groups, while the contrary is not the case. The findings underline that naive reliance on what others express online can be collectively dangerous, especially in an era in which social media shapes public discourse to an unprecedented extent.
Published: 2021

17. Towards individualized transcranial electric stimulation therapy through computer simulation

Author: Kalloch, Benjamin, Sehm, Bernhard, Hlawitschka, Mario, Scheuermann, Gerik, Wolters, Carsten, and Universität Leipzig
Subjects: Transkranielle Elektrohirnstimulation, Simulation, Magnetresonanztomographie Bildverarbeitung, Schlagfall, Unsicherheitsanalyse, ddc:000, Transcranial electric brain stimulation, simulation, magnetic resonance image processing, stroke, uncertainty analysis
Abstract: Transkranielle Elektrostimulation (tES) beschreibt eine Gruppe von Hirnstimulationstechniken, die einen schwachen elektrischen Strom über zwei nicht-invasiv am Kopf angebrachten Elektroden applizieren. Handelt es sich dabei um einen Gleichstrom, spricht man von transkranieller Gleichstromstimulation, auch tDCS abgekürzt. Die allgemeine Zielstellung aller Hirnstimulationstechniken ist Hirnfunktion durch ein Verstärken oder Dämpfen von Hirnaktivität zu beeinflussen. Unter den Stimulationstechniken wird die transkranielle Gleichstromstimulation als ein adjuvantes Werkzeug zur Unterstützung der mikroskopischen Reorganisation des Gehirnes in Folge von Lernprozessen und besonders der Rehabilitationstherapie nach einem Schlaganfall untersucht. Aktuelle Herausforderungen dieser Forschung sind eine hohe Variabilität im erreichten Stimulationseffekt zwischen den Probanden sowie ein unvollständiges Verständnis des Zusammenspiels der der Stimulation zugrundeliegenden Mechanismen. Als Schlüsselkomponente für das Verständnis der Stimulationsmechanismen wird das zwischen den Elektroden im Kopf des Probanden aufgebaute elektrische Feld erachtet. Einem grundlegenden Konzept folgend wird angenommen, dass Hirnareale, die einer größeren elektrischen Feldstärke ausgesetzt sind, ebenso einen höheren Stimulationseffekt erfahren. Damit kommt der Positionierung der Elektroden eine entscheidende Rolle für die Stimulation zu. Allerdings verteilt sich das elektrische Feld wegen des heterogenen elektrischen Leitfähigkeitsprofil des menschlichen Kopfes nicht uniform im Gehirn der Probanden. Außerdem ist das Verteilungsmuster auf Grund anatomischer Unterschiede zwischen den Probanden verschieden. Die triviale Abschätzung der Ausbreitung des elektrischen Feldes anhand der bloßen Position der Stimulationselektroden ist daher nicht ausreichend genau für eine zielgerichtete Stimulation. Computerbasierte, biophysikalische Simulationen der transkraniellen Elektrostimulation ermöglichen die individuelle Approximation des Verteilungsmusters des elektrischen Feldes in Probanden basierend auf deren medizinischen Bildgebungsdaten. Sie werden daher zunehmend verwendet, um tDCS-Anwendungen zu planen und verifizieren, und stellen ein wesentliches Hilfswerkzeug auf dem Weg zu individualisierter Schlaganfall-Rehabilitationstherapie dar. Softwaresysteme, die den dahinterstehenden individualisierten Verarbeitungsprozess erleichtern und für ein breites Feld an Forschern zugänglich machen, wurden in den vergangenen Jahren für den Anwendungsfall in gesunden Erwachsenen entwickelt. Jedoch bleibt die Simulation von Patienten mit krankhaftem Hirngewebe und strukturzerstörenden Läsionen eine nicht-triviale Aufgabe. Daher befasst sich das hier vorgestellte Projekt mit dem Aufbau und der praktischen Anwendung eines Arbeitsablaufes zur Simulation transkranieller Elektrostimulation. Dabei stand die Anforderung im Vordergrund medizinische Bildgebungsdaten insbesondere neurologischer Patienten mit krankhaft verändertem Hirngewebe verarbeiten zu können. Der grundlegende Arbeitsablauf zur Simulation wurde zunächst für gesunde Erwachsene entworfen und validiert. Dies umfasste die Zusammenstellung medizinischer Bildverarbeitungsalgorithmen zu einer umfangreichen Verarbeitungskette, um elektrisch relevante Strukturen in den Magnetresonanztomographiebildern des Kopfes und des Oberkörpers der Probanden zu identifizieren und zu extrahieren. Die identifizierten Strukturen mussten in Computermodelle überführt werden und das zugrundeliegende, physikalische Problem der elektrischen Volumenleitung in biologischen Geweben mit Hilfe numerischer Simulation gelöst werden. Im Verlauf des normalen Alterns ist das Gehirn strukturellen Veränderungen unterworfen, unter denen ein Verlust des Hirnvolumens sowie die Ausbildung mikroskopischer Veränderungen seiner Nervenfaserstruktur die Bedeutendsten sind. In einem zweiten Schritt wurde der Arbeitsablauf daher erweitert, um diese Phänomene des normalen Alterns zu berücksichtigen. Die vordergründige Herausforderung in diesem Teilprojekt war die biophysikalische Modellierung der veränderten Hirnmikrostruktur, da die resultierenden Veränderungen im Leitfähigkeitsprofil des Gehirns bisher noch nicht in der Literatur quantifiziert wurden. Die Erweiterung des Simulationsablauf zeichnete sich vorrangig dadurch aus, dass mit unsicheren elektrischen Leitfähigkeitswerten gearbeitet werden konnte. Damit war es möglich den Einfluss der ungenau bestimmbaren elektrischen Leitfähigkeit der verschiedenen biologischen Strukturen des menschlichen Kopfes auf das elektrische Feld zu ermitteln. In einer Simulationsstudie, in der Bilddaten von 88 Probanden einflossen, wurde die Auswirkung der veränderten Hirnfaserstruktur auf das elektrische Feld dann systematisch untersucht. Es wurde festgestellt, dass sich diese Gewebsveränderungen hochgradig lokal und im Allgemeinen gering auswirken. Schließlich wurden in einem dritten Schritt Simulationen für Schlaganfallpatienten durchgeführt. Ihre großen, strukturzerstörenden Läsionen wurden dabei mit einem höheren Detailgrad als in bisherigen Arbeiten modelliert und physikalisch abermals mit unsicheren Leitfähigkeiten gearbeitet, was zu unsicheren elektrischen Feldabschätzungen führte. Es wurden individuell berechnete elektrische Felddaten mit der Hirnaktivierung von 18 Patienten in Verbindung gesetzt, unter Berücksichtigung der inhärenten Unsicherheit in der Bestimmung der elektrischen Felder. Das Ziel war zu ergründen, ob die Hirnstimulation einen positiven Einfluss auf die Hirnaktivität der Patienten im Kontext von Rehabilitationstherapie ausüben und so die Neuorganisierung des Gehirns nach einem Schlaganfall unterstützen kann. Während ein schwacher Zusammenhang hergestellt werden konnte, sind weitere Untersuchungen nötig, um diese Frage abschließend zu klären.:Kurzfassung Abstract Contents 1 Overview 2 Anatomical structures in magnetic resonance images 2 Anatomical structures in magnetic resonance images 2.1 Neuroanatomy 2.2 Magnetic resonance imaging 2.3 Segmentation of MR images 2.4 Image morphology 2.5 Summary 3 Magnetic resonance image processing pipeline 3.1 Introduction to human body modeling 3.2 Description of the processing pipeline 3.3 Intermediate and final outcomes in two subjects 3.4 Discussion, limitations & future work 3.5 Conclusion 4 Numerical simulation of transcranial electric stimulation 4.1 Electrostatic foundations 4.2 Discretization of electrostatic quantities 4.3 The numeric solution process 4.4 Spatial discretization by volume meshing 4.5 Summary 5 Simulation workflow 5.1 Overview of tES simulation pipelines 5.2 My implementation of a tES simulation workflow 5.3 Verification & application examples 5.4 Discussion & Conclusion 6 Transcranial direct current stimulation in the aging brain 6.1 Handling age-related brain changes in tES simulations 6.2 Procedure of the simulation study 6.3 Results of the uncertainty analysis 6.4 Findings, limitations and discussion 7 Transcranial direct current stimulation in stroke patients 7.1 Bridging the gap between simulated electric fields and brain activation in stroke patients 7.2 Methodology for relating simulated electric fields to functional MRI data 7.3 Evaluation of the simulation study and correlation analysis 7.4 Discussion & Conclusion 8 Outlooks for simulations of transcranial electric stimulation List of Figures List of Tables Glossary of Neuroscience Terms Glossary of Technical Terms Bibliography Transcranial electric current stimulation (tES) denotes a group of brain stimulation techniques that apply a weak electric current over two or more non-invasively, head-mounted electrodes. When employing a direct-current, this method is denoted transcranial direct current stimulation (tDCS). The general aim of all tES techniques is the modulation of brain function by an up- or downregulation of brain activity. Among these, transcranial direct current stimulation is investigated as an adjuvant tool to promote processes of the microscopic reorganization of the brain as a consequence of learning and, more specifically, rehabilitation therapy after a stroke. Current challenges of this research are a high variability in the achieved stimulation effects across subjects and an incomplete understanding of the interplay between its underlying mechanisms. A key component to understanding the stimulation mechanism is considered the electric field, which is exerted by the electrodes and distributes in the subjects' heads. A principle concept assumes that brain areas exposed to a higher electric field strength likewise experience a higher stimulation. This attributes the positioning of the electrodes a decisive role for the stimulation. However, the electric field distributes non-uniformly across subjects' brains due to the heterogeneous electrical conductivity profile of the human head. Moreover, the distribution pattern is variable between subjects due to their individual anatomy. A trivial estimation of the distribution of the electric field solely based on the position of the stimulating electrodes is, therefore, not precise enough for a well-targeted stimulation. Computer-based biophysical simulations of transcranial electric stimulation enable the individual approximation of the distribution pattern of the electric field in subjects based on their medical imaging data. They are, thus, increasingly employed for the planning and verification of tDCS applications and constitute an essential tool on the way to individualized stroke rehabilitation therapy. Software pipelines facilitating the underlying individualized processing for a wide range of researchers have been developed for use in healthy adults over the past years, but, to date, the simulation of patients with abnormal brain tissue and structure disrupting lesions remains a non-trivial task. Therefore, the presented project was dedicated to establishing and practically applying a tES simulation workflow. The processing of medical imaging data of neurological patients with abnormal brain tissue was a central requirement in this process. The basic simulation workflow was first designed and validated for the simulation of healthy adults. This comprised compiling medical image processing algorithms into a comprehensive workflow to identify and extract electrically relevant physiological structures of the human head and upper torso from magnetic resonance images. The identified structures had to be converted to computational models. The underlying physical problem of electric volume conduction in biological tissue was solved by means of numeric simulation. Over the course of normal aging, the brain is subjected to structural alterations, among which a loss of brain volume and the development of microscopic alterations of its fiber structure are the most relevant. In a second step, the workflow was, thus, extended to incorporate these phenomena of normal aging. The main challenge in this subproject was the biophysical modeling of the altered brain microstructure as the resulting alterations to the conductivity profile of the brain were so far not quantified in the literature. Therefore, the augmentation of the workflow most notably included the modeling of uncertain electrical properties. With this, the influence of the uncertain electrical conductivity of the biological structures of the human head on the electric field could be assessed. In a simulation study, including imaging data of 88 subjects, the influence of the altered brain fiber structure on the electric field was then systematically investigated. These tissue alterations were found to exhibit a highly localized and generally low impact. Finally, in a third step, tDCS simulations of stroke patients were conducted. Their large, structure-disrupting lesions were modeled in a more detailed manner than in previous stroke simulation studies, and they were physically, again, modeled by uncertain electrical conductivity resulting in uncertain electric field estimates. Individually simulated electric fields were related to the brain activation of 18 patients, considering the inherently uncertain electric field estimations. The goal was to clarify whether the stimulation exerts a positive influence on brain function in the context of rehabilitation therapy supporting brain reorganization following a stroke. While a weak correlation could be established, further investigation will be necessary to answer that research question.:Kurzfassung Abstract Contents 1 Overview 2 Anatomical structures in magnetic resonance images 2 Anatomical structures in magnetic resonance images 2.1 Neuroanatomy 2.2 Magnetic resonance imaging 2.3 Segmentation of MR images 2.4 Image morphology 2.5 Summary 3 Magnetic resonance image processing pipeline 3.1 Introduction to human body modeling 3.2 Description of the processing pipeline 3.3 Intermediate and final outcomes in two subjects 3.4 Discussion, limitations & future work 3.5 Conclusion 4 Numerical simulation of transcranial electric stimulation 4.1 Electrostatic foundations 4.2 Discretization of electrostatic quantities 4.3 The numeric solution process 4.4 Spatial discretization by volume meshing 4.5 Summary 5 Simulation workflow 5.1 Overview of tES simulation pipelines 5.2 My implementation of a tES simulation workflow 5.3 Verification & application examples 5.4 Discussion & Conclusion 6 Transcranial direct current stimulation in the aging brain 6.1 Handling age-related brain changes in tES simulations 6.2 Procedure of the simulation study 6.3 Results of the uncertainty analysis 6.4 Findings, limitations and discussion 7 Transcranial direct current stimulation in stroke patients 7.1 Bridging the gap between simulated electric fields and brain activation in stroke patients 7.2 Methodology for relating simulated electric fields to functional MRI data 7.3 Evaluation of the simulation study and correlation analysis 7.4 Discussion & Conclusion 8 Outlooks for simulations of transcranial electric stimulation List of Figures List of Tables Glossary of Neuroscience Terms Glossary of Technical Terms Bibliography
Published: 2021

18. New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

Author: Gatter, Thomas and Universität Leipzig
Subjects: ddc:000, Assembly, Transcriptome, Genome, Sequencing, Graph Theory
Abstract: Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The Ryūtō core algorithm 4.4 Improved Multi-sample transcript assembly with Ryūtō 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusion
Published: 2021

19. Geospatial Analysis and Modeling of Textual Descriptions of Pre-modern Geography

Author: Seydi Gheranghiyeh, Masoumeh and Universität Leipzig
Subjects: Digital Humanities, Classical Geography, Classical Islamic Geography, Modeling, Visualization, ddc:000
Abstract: Textual descriptions of pre-modern geography offer a different view of classical geography. The descriptions have been produced when none of the modern geographical concepts and tools were available. In this dissertation, we study pre-modern geography by primarily finding the existing structures of the descriptions and different cases of geographical data. We first explain four major geographical cases in pre-modern Arabic sources: gazetteer, administrative hierarchies, routes, and toponyms associated with people. Focusing on hierarchical divisions and routes, we offer approaches for manual annotation of administrative hierarchies and route sections as well as a semi-automated toponyms annotation. The latter starts with a fuzzy search of toponyms from an authority list and applies two different extrapolation models to infer true or false values, based on the context, for disambiguating the automatically annotated toponyms. Having the annotated data, we introduce mathematical models to shape and visualize regions based on the description of administrative hierarchies. Moreover, we offer models for comparing hierarchical divisions and route networks from different sources. We also suggest approaches to approximate geographical coordinates for places that do not have geographical coordinates - we call them unknown places - which is a major issue in visualization of pre-modern places on map. The final chapter of the dissertation introduces the new version of al-Ṯurayyā, a gazetteer and a spatial model of the classical Islamic world using georeferenced data of a pre-modern atlas with more than 2, 000 toponyms and routes. It offers search, path finding, and flood network functionalities as well as visualizations of regions using one of the models that we describe for regions. However the gazetteer is designed using the classical Islamic world data, the spatial model and features can be used for similarly prepared datasets.:1 Introduction 1 2 Related Work 8 2.1 GIS 8 2.2 NLP, Georeferencing, Geoparsing, Annotation 10 2.3 Gazetteer 15 2.4 Modeling 17 3 Classical Geographical Cases 20 3.1 Gazetteer 21 3.2 Routes and Travelogues 22 3.3 Administrative Hierarchy 24 3.4 Geographical Aspects of Biographical Data 25 4 Annotation and Extraction 27 4.1 Annotation 29 4.1.1 Manual Annotation of Geographical Texts 29 4.1.1.1 Administrative Hierarchy 30 4.1.1.2 Routes and Travelogues 32 4.1.2 Semi-Automatic Toponym Annotation 34 4.1.2.1 The Annotation Process 35 4.1.2.2 Extrapolation Models 37 4.1.2.2.1 Frequency of Toponymic N-grams 37 4.1.2.2.2 Co-occurrence Frequencies 38 4.1.2.2.3 A Supervised ML Approach 40 4.1.2.3 Summary 45 4.2 Data Extraction and Structures 45 4.2.1 Administrative Hierarchy 45 4.2.2 Routes and Distances 49 5 Modeling Geographical Data 51 5.1 Mathematical Models for Administrative Hierarchies 52 5.1.1 Sample Data 53 5.1.2 Quadtree 56 5.1.3 Voronoi Diagram 58 5.1.4 Voronoi Clippings 62 5.1.4.1 Convex Hull 62 5.1.4.2 Concave Hull 63 5.1.5 Convex Hulls 65 5.1.6 Concave Hulls 67 5.1.7 Route Network 69 5.1.8 Summary of Models for Administrative Hierarchy 69 5.2 Comparison Models 71 5.2.1 Hierarchical Data 71 5.2.1.1 Test Data 73 5.2.2 Route Networks 76 5.2.2.1 Post-processing 81 5.2.2.2 Applications 82 5.3 Unknown Places 84 6 Al-Ṯurayyā 89 6.1 Introducing al-Ṯurayyā 90 6.2 Gazetteer 90 6.3 Spatial Model 91 6.3.1 Provinces and Administrative Divisions 93 6.3.2 Pathfinding and Itineraries 93 6.3.3 Flood Network 96 6.3.4 Path Alignment Tool 97 6.3.5 Data Structure 99 6.3.5.1 Places 100 6.3.5.2 Routes and Distances 100 7 Conclusions and Further Work 101
Published: 2021

20. Gene Family Histories: Theory and Algorithms

Author: Schaller, David and Universität Leipzig
Subjects: phylogenetics, gene families, reconciliation, orthology, xenology, ddc:000
Abstract: Detailed gene family histories and reconciliations with species trees are a prerequisite for studying associations between genetic and phenotypic innovations. Even though the true evolutionary scenarios are usually unknown, they impose certain constraints on the mathematical structure of data obtained from simple yes/no questions in pairwise comparisons of gene sequences. Recent advances in this field have led to the development of methods for reconstructing (aspects of) the scenarios on the basis of such relation data, which can most naturally be represented by graphs on the set of considered genes. We provide here novel characterizations of best match graphs (BMGs) which capture the notion of (reciprocal) best hits based on sequence similarities. BMGs provide the basis for the detection of orthologous genes (genes that diverged after a speciation event). There are two main sources of error in pipelines for orthology inference based on BMGs. Firstly, measurement errors in the estimation of best matches from sequence similarity in general lead to violations of the characteristic properties of BMGs. The second issue concerns the reconstruction of the orthology relation from a BMG. We show how to correct estimated BMG to mathematically valid ones and how much information about orthologs is contained in BMGs. We then discuss implicit methods for horizontal gene transfer (HGT) inference that focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of an undirected graph, the later-divergence-time (LDT) graph. We explore the mathematical structure of LDT graphs and show how much information about all HGT events is contained in such LDT graphs.
Published: 2021

21. A Revision of Procedural Knowledge in the conML Framework

Author: Große, Florian Peter, Schmid, Thomas, Bogdan, Martin, and Universität Leipzig
Subjects: Maschinelles Lernen, Konstruktivistisches Maschinelles Lernen, Hybride KI, conML, ML, KI, machine learning, constructivist machine learning, hybrid AI, conML, ML, AI, ddc:000
Abstract: Machine learning methods have been used very successfully for quite some time to recognize patterns, model correlations and generate hypotheses. However, the possibilities for weighing and evaluating the resulting models and hypotheses, and the search for alternatives and contradictions are still predominantly reserved for humans. For this purpose, the novel concept of constructivist machine learning (conML) formalizes limitations of model validity and employs constructivist learning theory to enable doubting of new and existing models with the possibility of integrating, discarding, combining, and abstracting knowledge. The present work identifies issues that impede the systems capability to abstract knowledge from generated models for tasks that lie in the domain of procedural knowledge, and proposes and implements identified solutions. To this end, the conML framework has been reimplemented in the Julia programming language and subsequently been extended. Using a synthetic dataset of impedance spectra of modeled epithelia that has previously been analyzed with an existing implementation of conML, existing and new implementations are tested for consistency and proposed algorithmic changes are evaluated with respect to changes in model generation and abstraction ability when exploring unknown data. Recommendations for specific settings and suggestions for further research are derived from the results. In terms of performance, flexibility and extensibility, the new implementation of conML in Julia provides a good starting point for further research and application of the system.:Contents Abstract . . . . . III Zusammenfassung . . . . . IV Danksagung . . . . . V Selbstständigkeitserklärung . . . . . V 1. Introduction 1.1. Research Questions . . . . . 2 2. Related Work 2.1. Hybrid AI Systems . . . . . 5 2.2. Constructivist Machine Learning (conML) . . . . . 6 2.3. Implemented Methods . . . . . 9 2.3.1. Unsupervised Machine Learning . . . . . 9 2.3.2. Supervised Machine Learning . . . . . 11 2.3.3. Supervised Feature Selection . . . . . 13 2.3.4. Unsupervised Feature Selection . . . . . 17 3. Methods and Implementation 3.1. Notable Algorithmic Changes . . . . . 19 3.1.1. Rescaling of Target Values . . . . . 19 3.1.2. ExtendedWinner Selection . . . . . 21 3.2. Package Structure . . . . . 23 3.3. Interfaces and Implementation of Specific Methods . . . . . 29 3.4. Datasets . . . . . 41 4. Results 4.1. Validation Against the conML Prototype . . . . . 43 4.2. Change in Abstraction Capability . . . . . 49 4.2.1. Influence of Target Scaling . . . . . 49 4.2.2. Influence of the Parameter kappa_p . . . . . 55 4.2.3. Influence of the Winner Selection Procedure . . . . . 61 5. Discussion 5.1. Reproduction Results . . . . . 67 5.2. Rescaling of Constructed Targets . . . . . 69 5.3. kappa_p and the Selection of Winner Models . . . . . 71 6. Conclusions 6.1. Contributions of this Work . . . . . 77 6.2. Future Work . . . . . 78 A. Julia Language Reference . . . . . 81 B. Additional Code Listings . . . . . 91 C. Available Parameters . . . . . 99 C.1. Block Processing . . . . . 105 D. Configurations Reference . . . . . 107 D.1. Unsupervised Methods . . . . . 107 D.2. Supervised Methods . . . . . 108 D.3. Feature Selection . . . . . 109 D.4. Winner Selection . . . . . 110 D.5. General Settings . . . . . 110 E. Supplemental Figures . . . . . 113 E.1. Replacing MAPE with RMSE for Z-Transform Target Scaling . . . . . 113 E.2. Combining Target Rescaling, Winner Selection and High kappa_p . . . . . 119 Bibliography . . . . . 123 List of Figures . . . . . 129 List of Listings . . . . . 133 List of Tables . . . . . 135 Maschinelle Lernverfahren werden seit geraumer Zeit sehr erfolgreich zum Erkennen von Mustern, Abbilden von Zusammenhängen und Generieren von Hypothesen eingesetzt. Die Möglichkeiten zum Abwägen und Bewerten der entstandenen Modelle und Hypothesen, und die Suche nach Alternativen und Widersprüchen sind jedoch noch überwiegend dem Menschen vorbehalten. Das neuartige Konzept des konstruktivistischen maschinellen Lernens (conML) formalisiert dazu die Grenzen der Gültigkeit von Modellen und ermöglicht mittels konstruktivistischer Lerntheorie ein Zweifeln über neue und bestehende Modelle mit der Möglichkeit zum Integrieren, Verwerfen, Kombinieren und Abstrahieren von Wissen. Die vorliegende Arbeit identifiziert Probleme, die die Abstraktionsfähigkeit des Systems bei Aufgabenstellungen in der Prozeduralen Wissensdomäne einschränken, bietet Lösungsvorschläge und beschreibt deren Umsetzung. Das algorithmische Framework conML ist dazu in der Programmiersprache Julia reimplementiert und anschließend erweitert worden. Anhand eines synthetischen Datensatzes von Impedanzspektren modellierter Epithelien, der bereits mit einem Prototypen des conML Systems analysiert worden ist, werden bestehende und neue Implementierung auf Konsistenz geprüft und die vorgeschlagenen algorithmischen Änderungen im Hinblick auf Veränderungen beim Erzeugen von Modellen und der Abstraktionsfähigkeit bei der Exploration unbekannter Daten untersucht. Aus den Ergebnissen werden Empfehlungen zu konkreten Einstellungen sowie Vorschläge für weitere Untersuchungen abgeleitet. Die neue Implementierung von conML in Julia bietet im Hinblick auf Performanz, Flexibilität und Erweiterbarkeit einen guten Ausgangspunkt für weitere Forschung und Anwendung des Systems.:Contents Abstract . . . . . III Zusammenfassung . . . . . IV Danksagung . . . . . V Selbstständigkeitserklärung . . . . . V 1. Introduction 1.1. Research Questions . . . . . 2 2. Related Work 2.1. Hybrid AI Systems . . . . . 5 2.2. Constructivist Machine Learning (conML) . . . . . 6 2.3. Implemented Methods . . . . . 9 2.3.1. Unsupervised Machine Learning . . . . . 9 2.3.2. Supervised Machine Learning . . . . . 11 2.3.3. Supervised Feature Selection . . . . . 13 2.3.4. Unsupervised Feature Selection . . . . . 17 3. Methods and Implementation 3.1. Notable Algorithmic Changes . . . . . 19 3.1.1. Rescaling of Target Values . . . . . 19 3.1.2. ExtendedWinner Selection . . . . . 21 3.2. Package Structure . . . . . 23 3.3. Interfaces and Implementation of Specific Methods . . . . . 29 3.4. Datasets . . . . . 41 4. Results 4.1. Validation Against the conML Prototype . . . . . 43 4.2. Change in Abstraction Capability . . . . . 49 4.2.1. Influence of Target Scaling . . . . . 49 4.2.2. Influence of the Parameter kappa_p . . . . . 55 4.2.3. Influence of the Winner Selection Procedure . . . . . 61 5. Discussion 5.1. Reproduction Results . . . . . 67 5.2. Rescaling of Constructed Targets . . . . . 69 5.3. kappa_p and the Selection of Winner Models . . . . . 71 6. Conclusions 6.1. Contributions of this Work . . . . . 77 6.2. Future Work . . . . . 78 A. Julia Language Reference . . . . . 81 B. Additional Code Listings . . . . . 91 C. Available Parameters . . . . . 99 C.1. Block Processing . . . . . 105 D. Configurations Reference . . . . . 107 D.1. Unsupervised Methods . . . . . 107 D.2. Supervised Methods . . . . . 108 D.3. Feature Selection . . . . . 109 D.4. Winner Selection . . . . . 110 D.5. General Settings . . . . . 110 E. Supplemental Figures . . . . . 113 E.1. Replacing MAPE with RMSE for Z-Transform Target Scaling . . . . . 113 E.2. Combining Target Rescaling, Winner Selection and High kappa_p . . . . . 119 Bibliography . . . . . 123 List of Figures . . . . . 129 List of Listings . . . . . 133 List of Tables . . . . . 135
Published: 2021

22. On the Study of Fitness Landscapes and the Max-Cut Problem

Author: Rodriguez Fernandez, Angel Eduardo and Universität Leipzig
Subjects: ddc:000, Max-Cut, Max-k-Cut, Approximation Algorithms, Randomized Rounding, Clustering
Abstract: The goal of this thesis is to study the complexity of NP-Hard problems, using the Max-Cut and the Max-k-Cut problems, and the study of fitness landscapes. The Max-Cut and Max-k-Cut problems are well studied NP-hard problems specially since the approximation algorithm of Goemans and Williamson (1995) which introduced the use of SDP to solve relaxed problems. In order to prove the existence of a performance guarantee, the rounding step from the SDP solution to a Max-Cut solution is simple and randomized. For the Max-k-Cut problem, there exist several approximation algorithms but many of them have been proved to be equivalent. Similarly as in Max-Cut, these approximation algorithms use a simple randomized rounding to be able to get a performance guarantee. Ignoring for now the performance guarantee, one could ask if there is a rounding process that takes into account the structure of the relaxed solution since it is the result of an optimization problem. In this thesis we answered this question positively by using clustering as a rounding method. In order to compare the performance of both algorithms, a series of experiments were performed using the so-called G-set benchmark for the Max-Cut problem and using the Random Graph Benchmark of Goemans1995 for the Max-k-Cut problem. With this new rounding, larger cut values are found both for the Max-Cut and the Max-k-Cut problems, and always above the value of the performance guarantee of the approximation algorithm. This suggests that taking into account the structure of the problem to design algorithms can lead to better results, possibly at the cost of a worse performance guarantee. An example for the vertex k-center problem can be seen in Garcia-Diaz et al. (2017), where a 3-approximation algorithm performs better than a 2-approximation algorithm despite having a worse performance guarantee. Landscapes over discrete configurations spaces are an important model in evolutionary and structural biology, as well as many other areas of science, from the physics of disordered systems to operations research. A landscape is a function defined on a very large discrete set V that carries an additional metric or at least topological structure into the real numbers R. We will consider landscapes defined on the vertex set of undirected graphs. Thus let G=G(V,E) be an undirected graph and f an arbitrary real-valued function taking values from V . We will refer to the triple (V,E,f) as a landscape over G. We say two configurations x,y in V are neutral if f(x)=f(y). We colloquially refer to a landscape as 'neutral'' if a substantial fraction of adjacent pairs of configurations are neutral. A flat landscape is one where f is constant. The opposite of flatness is ruggedness and it is defined as the number of local optima or by means of pair correlation functions. These two key features of a landscape, ruggedness and neutrality, appear to be two sides of the same coin. Ruggedness can be measured either by correlation properties, which are sensitive to monotonic transformation of the landscape, and by combinatorial properties such as the lengths of downhill paths and the number of local optima, which are invariant under monotonic transformations. The connection between the two views has remained largely unexplored and poorly understood. For this thesis, a survey on fitness landscapes is presented, together with the first steps in the direction to find this connection together with a relation between the covariance matrix of a random landscape model and its ruggedness.
Published: 2021

23. Segmentation of Heterogeneous Multivariate Genome Annotation Data

Author: Saker, Halima, Universität Leipzig, and Lebanese Univeristy
Subjects: ddc:000, Bioinformatics, segmentation aggregation, consensus segmentation, Time series analysis, dynamic programming
Abstract: Due to the potential impact of next-generation sequencing NGS, we have seen a rapid increase in genomic information and annotation information that can be naturally mapped to genomic locations. In cancer research, for example, there are significant efforts to chart DNA methylation at single-nucleotide resolution. The NIH Roadmap Epigenomics Projects, on the other hand, has set out to chart a large number of different histone modifications. However, throughout the last few years, a very diverse set of aspects has become the aim of large-scale experiments with a genome-wide readout. Therefore, the identification of functional units of the genomic DNA is considered a significant and essential challenge. Subsequently, we have been motivated to implement multi-dimensional segmentation approaches that serve gene variety and genome heterogeneity. The segmentation of multivariate genomic, epigenomic, and transcriptomic data from multiple time points, tissue, and cell types to compare changes in genomic organization and identify common elements form the headline of our research. Next generation sequencing offers a rich material used in bioinformatics research to find answers, solutions, and exploration for the molecular functions, diseases causes, etc. Rapid advances in technology also have led to the proliferation of types of experiments. Although sharing next-generation sequencing as the readout produces signals with an entirely different inherent resolution, ranging from a precise transcript structure at the single-nucleotide resolution to pull-down and enrichment-based protocols with resolutions on order 100 nt to chromosome conformation data that are only accurate at kilobase resolution. Therefore, the main goal of the dissertation project is to design, implement, and test novel segmentation algorithms that work on one- and multi-dimensional and can accommodate data of different types and resolutions. The target data in this project is multivariate genetic, epigenetic, transcriptomic, and proteomic data; the reason is that these datasets can change under the effect of several conditions such as chemical, genetic and epigenetic modifications. A promising approach towards this end is to identify intervals of the genomic DNA that behave coherently in multiple conditions and tissues and could be defined as intervals on which all measured quantities are constant within each experiment. A naive approach would take each data set in isolation and estimate intervals in which the signal at hand is constant. Another approach takes datasets all at once as input without recurring to one-dimensional segmentation. Once implemented, the algorithm should be applied on heterogeneous genomic, transcriptomic, proteomic, and epigenomic data; the aim here is to draw and improve the map of functionally coherent segments of a genome. Current approaches either focus on individual datasets, as in the case of tiling array transcriptomics data; Or on the analysis of comparable experiments such as ChIP-seq data for various histone modifications. The simplest sub-problem in segmentation is to decide whether two adjacent intervals should form two distinct segments or whether they should be combined into a single one. We have to find out how this should be done in the multi-D segmentation; in 1-D, this is relatively well known. This leads to a segmentation of the genome concerning the particular dataset. The intersection of segmentations for different datasets could identify then the DNA elements.
Published: 2021

24. Unsupervised Pretraining of Neural Networks with Multiple Targets using Siamese Architectures

Author: Bryan, Maximilian and Universität Leipzig
Subjects: ddc:000, KI, Modelle, Training
Abstract: A model's response for a given input pattern depends on the seen patterns in the training data. The larger the amount of training data, the more likely edge cases are covered during training. However, the more complex input patterns are, the larger the model has to be. For very simple use cases, a relatively small model can achieve very high test accuracy in a matter of minutes. On the other hand, a large model has to be trained for multiple days. The actual time to develop a model of that size can be considered to be even greater since often many different architecture types and hyper-parameter configurations have to be tried. An extreme case for a large model is the recently released GPT-3 model. This model consists of 175 billion parameters and was trained using 45 terabytes of text data. The model was trained to generate text and is able to write news articles and source code based only on a rough description. However, a model like this is only creatable for researchers with access to special hardware or immense amounts of data. Thus, it is desirable to find less resource-intensive training approaches to enable other researchers to create well performing models. This thesis investigates the use of pre-trained models. If a model has been trained on one dataset and is then trained on another similar data, it faster learns to adjust to similar patterns than a model that has not yet seen any of the task's pattern. Thus, the learned lessons from one training are transferred to another task. During pre-training, the model is trained to solve a specific task like predicting the next word in a sequence or first encoding an input image before decoding it. Such models contain an encoder and a decoder part. When transferring that model to another task, parts of the model's layers will be removed. As a result, having to discard fewer weights results in faster training since less time has to be spent on training parts of a model that are only needed to solve an auxiliary task. Throughout this thesis, the concept of siamese architectures will be discussed since when using that architecture, no parameters have to be discarded when transferring a model trained with that approach onto another task. Thus, the siamese pre-training approach positively impacts the need for resources like time and energy use and drives the development of new models in the direction of Green AI. The models trained with this approach will be evaluated by comparing them to models trained with other pre-training approaches as well as large existing models. It will be shown that the models trained for the tasks in this thesis perform as good as externally pre-trained models, given the right choice of data and training targets: It will be shown that the number and type of training targets during pre-training impacts a model's performance on transfer learning tasks. The use cases presented in this thesis cover different data from different domains to show that the siamese training approach is widely applicable. Consequently, researchers are motivated to create their own pre-trained models for data domains, for which there are no existing pre-trained models. Die Vorhersage eines Models hängt davon ab, welche Muster in den während des Trainings benutzen Daten vorhanden sind. Je größer die Menge an Trainingsdaten ist, desto wahrscheinlicher ist es, dass Grenzfälle in den Daten vorkommen. Je größer jedoch die Anzahl der zu lernenden Mustern ist, desto größer muss jedoch das Modell sein. Für einfache Anwendungsfälle ist es möglich ein kleines Modell in wenigen Minuten zu trainieren um bereits gute Ergebnisse auf Testdaten zu erhalten. Für komplexe Anwendungsfälle kann ein dementsprechend großes Modell jedoch bis zu mehrere Tage benötigen um ausreichend gut zu sein. Ein Extremfall für ein großes Modell ist das kürzlich veröffentlichte Modell mit dem Namen GPT-3, welches aus 175 Milliarden Parametern besteht und mit Trainingsdaten in der Größenordnung von 45 Terabyte trainiert wurde. Das Modell wurde trainiert Text zu generieren und ist in der Lage Nachrichtenartikel zu generieren, basierend auf einer groben Ausgangsbeschreibung. Solch ein Modell können nur solche Forscher entwickeln, die Zugang zu entsprechender Hardware und Datenmengen haben. Es demnach von Interesse Trainingsvorgehen dahingehend zu verbessern, dass auch mit wenig vorhandenen Ressourcen Modelle für komplexe Anwendungsfälle trainiert werden können. Diese Arbeit beschäfigt sich mit dem Vortrainieren von neuronalen Netzen. Wenn ein neuronales Netz auf einem Datensatz trainiert wurde und dann auf einem zweiten Datensatz weiter trainiert wird, lernt es die Merkmale des zweiten Datensatzes schneller, da es nicht von Grund auf Muster lernen muss sondern auf bereits gelerntes zurückgreifen kann. Man spricht dann davon, dass das Wissen transferiert wird. Während des Vortrainierens bekommt ein Modell häufig eine Aufgabe wie zum Beispiel, im Fall von Bilddaten, die Trainingsdaten erst zu komprimieren und dann wieder herzustellen. Bei Textdaten könnte ein Modell vortrainiert werden, indem es einen Satz als Eingabe erhält und dann den nächsten Satz aus dem Quelldokument vorhersagen muss. Solche Modelle bestehen dementsprechend aus einem Encoder und einem Decoder. Der Nachteil bei diesem Vorgehen ist, dass der Decoder lediglich für das Vortrainieren benötigt wird und für den späteren Anwendungsfall nur der Encoder benötigt wird. Zentraler Bestandteil in dieser Arbeit ist deswegen das Untersuchen der Vorteile und Nachteile der siamesische Modellarchitektur. Diese Architektur besteht lediglich aus einem Encoder, was dazu führt, dass das Vortrainieren kostengünstiger ist, da weniger Gewichte trainiert werden müssen. Der wesentliche wissenschaftliche Beitrag liegt darin, dass die siamische Architektur ausführlich verglichen wird mit vergleichbaren Ansätzen. Dabei werden bestimmte Nachteile gefunden, wie zum Beispiel dass die Auswahl einer Ähnlichkeitsfunktion oder das Zusammenstellen der Trainingsdaten große Auswirkung auf das Modelltraining haben. Es wird erarbeitet, welche Ähnlichkeitsfunktion in welchen Kontexten empfohlen wird sowie wie andere Nachteile der siamischen Architektur durch die Anpassung der Trainingsziele ausgeglichen werden können. Die entsprechenden Experimente werden dabei auf Daten aus unterschiedlichen Domänen ausgeführt um zu zeigen, dass der entsprechende Ansatz universell anwendbar ist. Die Ergebnisse aus konkreten Anwendungsfällen zeigen außerdem, dass die innerhalb dieser Arbeit entwickelten Modelle ähnlich gut abschneiden wie extern verfügbare Modelle, welche mit großem Ressourcenaufwand trainiert worden sind. Dies zeigt, dass mit Bedacht erarbeitete Architekturen die benötigten Ressourcen verringern können.
Published: 2021

25. Unraveling the genetic secrets of ancient Baikal amphipods

Author: Rivarola-Duarte, Lorena and Universität Leipzig
Subjects: ddc:000, genomics, transcriptomics, amphipods, Baikal
Abstract: Lake Baikal is the oldest, by volume, the largest, and the deepest freshwater lake on Earth. It is characterized by an outstanding diversity of endemic faunas with more than 350 amphipod species and subspecies (Amphipoda, Crustacea, Arthropoda). They are the dominant benthic organisms in the lake, contributing substantially to the overall biomass. Eulimnogammarus verrucosus, E. cyaneus, and E. vittatus, in particular, serve as emerging models in ecotoxicological studies. It was, then, necessary to investigate whether these endemic littoral amphipods species form genetically separate populations across Baikal, to scrutinize if the results obtained --~for example, about stress responses~-- with samples from one single location (Bolshie Koty, where the biological station is located), could be extrapolated to the complete lake or not. The genetic diversity within those three endemic littoral amphipod species was determined based on fragments of Cytochrome C Oxidase I (COI) and 18S rDNA (only for E. verrucosus). Gammarus lacustris, a Holarctic species living in water bodies near Baikal, was examined for comparison. The intra-specific genetic diversities within E. verrucosus and E. vittatus (13% and 10%, respectively) were similar to the inter-species differences, indicating the occurrence of cryptic, morphologically highly similar species. This was confirmed with 18S rDNA for E. verrucosus. The haplotypes of E. cyaneus and G. lacustris specimens were, with intra-specific genetic distances of 3% and 2%, respectively, more homogeneous, indicating no --or only recent disruption of-- gene flow of E. cyaneus across Baikal, and recent colonization of water bodies around Baikal by G. lacustris. The data provide the first clear evidence for the formation of cryptic (sub)species within endemic littoral amphipod species of Lake Baikal and mark the inflows/outflow of large rivers as dispersal barriers. Lake Baikal has provided a stable environment for millions of years, in stark contrast to small, transient water bodies in its immediate vicinity. A highly diverse endemic amphipod fauna is found in one but not the other habitat. To gain more insights and explain the immiscibility barrier between Lake Baikal and non-Baikal environments faunas, the differences in the stress response pathways were studied. To this end, exposure experiments to increasing temperature and a heavy metal (cadmium) as proteotoxic stressors were conducted in Russia. High-quality de novo transcriptome assemblies were obtained, covering multiple conditions, for three amphipod species: E. verrucosus and E. cyaneus -Baikal endemics-, and G. lacustris -Holarctic- as a potential invader. After comparing the transcriptomic stress responses, it was found that both Baikal species possess intact stress response systems and respond to elevated temperature with relatively similar changes in their expression profiles. G. lacustris reacts less strongly to the same stressors, possibly because its transcriptome is already perturbed by acclimation conditions (matching the Lake Baikal littoral). Comprehensive genomic resources are of utmost importance for ecotoxicological and ecophysiological studies in an evolutionary context, especially considering the exceptional value of Baikal as a UNESCO World Heritage Site. In that context, the results presented here, on the genome of Eulimnogammarus verrucosus, have been the first massive step to establish genomic sequence resources for a Baikalian amphipod (other than mitochondrial genomes and gene expression data in the form of de novo transcriptomes assemblies). Based on the data from a survey of its genome (a single lane of paired-end Illumina HiSeq 2000 reads, 3X) as well as a full dataset (two complete flow cells, 46X) the genome size was estimated as nearly 10 Gb based on the k-mer spectra and the coverage of highly conserved miRNA, hox genes, and other Sanger-sequenced genes. At least two-thirds of the genome are non-unique DNA, and no less than half of the genomic DNA is composed of just five families of repetitive elements, including low complexity sequences. Some of the repeats families found in high abundance in E. verrucosus seem to be species-specific, or Baikalian-specific. Attempts to use off-the-shelf assembly tools on the available low coverage data, both before and after the removal of highly repetitive components, as well as on the full dataset, resulted in extremely fragmented assemblies. Nevertheless, the analysis of coverage in Hox genes and their homeobox showed no clear evidence for paralogs, indicating that a genome duplication did not contribute to the large genome size. Several mate-pair libraries with bigger insert sizes than the 2kb used here and long reads sequencing technology combined with semi-automated methods for genome assembly seem to be necessary to obtain a reliable assembly for this species.
Published: 2021

26. Design und Implementierung eines Software-Ökosystems für textbasierte Inhaltsanalysen in den Sozialwissenschaften mit Schwerpunkt auf der Detektion schwacher Signale

Author: Kahmann, Christian and Universität Leipzig
Subjects: ddc:000, Text Mining, Text Mining Infrastrukturen, Digital Humanities, Semantic Change
Abstract: Der Einsatz von automatisierten quantitativen Methoden in den Sozialwissenschaften gewinnt stetig an Bedeutung. Dies hat zum einen mit der rasant wachsenden Menge und Verfügbarkeit digital vorliegender Daten zu tun. Zum anderen erlauben es innovative automatisierte Ansätze, Ergebnisse zu produzieren, welche durch qualitative Arbeit allein nicht möglich wären. Die Implementierung innovativer Algorithmen zur Anwendung quantitativer Verfahren bedarf jedoch eines großen Maßes an Wissen im Bereich der Programmierung sowie der Funktionsweise der anzuwendenden Methoden. Da dieses Expertenwissen aber nur in den wenigsten Fällen in rein sozialwissenschaftlichen Projekten vorhanden ist, ist es notwendig, andere Lösungsmöglichkeiten zur Anwendung automatisierter quantitativer Verfahren in den Sozialwissenschaften zu nutzen. Lediglich die Bereiche der Computational Social Science sowie die Digital Humanities stellen Forschungsbereiche der Sozialwissenschaften dar, welche als Vorreiter bereits Erfahrungen im Umgang mit automatisierten quantitativen Verfahren aufweisen. Eine mögliche Lösung für den breiten Einsatz von automatisierten Verfahren in den gesamten Sozialwissenschaften ist die Erstellung und Anwendung von Text-Mining-Infrastrukturen, die speziell für den Einsatz in den Sozialwissenschaften ausgerichtet sind. Diese erlauben es Sozialwissenschaftlern, mit einer vergleichsweise geringen Einstiegshürde aktuelle Verfahren und Forschungsansätze der Bereiche Text Mining und Machine Learning auf ihre eigenen Forschungsfragen und Daten anwenden zu können. Damit diese Infrastrukturen aber auch tatsächlich einen deutlichen Mehrwert für den Sozialwissenschaftler darstellen, müssen verschiedene Anforderungen erfüllt werden. Diese teilen sich auf in generelle an Software gestellte Forderungen wie beispielsweise Skalierbarkeit und Performanz sowie in spezifische Anforderungen für die Anwendung in den Sozialwissenschaften. Zu diesen speziellen Anforderungen zählt die Möglichkeit des Umgangs mit verschiedenartigen Datengrundlagen. In dieser Arbeit wird der Fokus auf textuelle Daten gelegt, wobei auch diese sehr große Unterschiede in ihrer Charakteristik und damit in deren notwendiger Verarbeitung aufweisen. Es werden darüber hinaus drei Schlüsselanforderungen identifiziert, die für den Einsatz inden Sozialwissenschaften essentiell sind. Die erste Schlüsselanforderung beschreibt die generelle Ausrichtung einer Text-MiningInfrastruktur als generische Plattform, welche durch die Merkmale von Anpassbarkeit, Erweiterbarkeit sowie der Möglichkeit des Exportes von Ergebnissen an die zahlreichen zum Teil sehr diversen Forschungsfragen der Sozialwissenschaften assimiliert werden kann. Die zweite Schlüsselanforderung stellt die Notwendigkeit, qualitative und quantitative Forschungsdesigns durch die Implementierung von dafür vorgesehenen Interfaces vereinen zu können, in den Vordergrund. Beide Forschungsansätze können auf diese Weise voneinander profitieren. Zuletzt wird noch die Bedeutung von schwachen Signalen als Forschungsgrundlage in den Sozialwissenschaften hervorgehoben. Für alle drei dieser Schlüsselanforderungen als auch die übrigen abgeleiteten Anforderungen an eine Text-Mining-Infrastruktur für den Einsatz in den Sozialwissenschaften werden mögliche Implementierungen und Lösungsansätze präsentiert. Dies geschieht zum einen durch die Beschreibung des Designs und der Entwicklung genau einer solchen Text-Mining-Infrastruktur am Beispiel des interactive Leipzig Corpus Miner. Es werden notwendige Abwägungen bezüglich verschiedener Implementierungsstrategien und Softwaredesignentscheidungen, welche zur Umsetzung der gestellten Anforderungen notwendig sind, erläutert. Zum anderen wird ein Maß zur Quantifizierung von diachronen Kontextänderungen in der Form der Kontextvolatilität vorgestellt. Das Maß wird im Laufe der Arbeit zur Detektion und Analyse schwacher Signale in textuellen Daten eingesetzt. Im letzten Teil der Arbeit werden die realisierten Umsetzungen der Schlüsselanforderungen am Beispiel verschiedener durchgeführter Projekte aufgezeigt. Die wichtigsten Beiträge dieser Arbeit sind damit zum Ersten eine Aufstellung spezifischer Anforderungen an Text-Mining-Infrastrukturen für den Einsatz in den Sozialwissenschaften. Zum Zweiten wird darauf aufbauend ein mögliches Design einer daraus resultierenden Forschungsumgebung detailliert erläutert. Den dritten Beitrag dieser Arbeit stellt die Weiterentwicklung der Kontextvolatilität als Verfahren zur Detektion schwacher Signale in diachronen Daten dar.
Published: 2021

27. Measuring the Energy Consumption of Software written in C on x86-64 Processors

Author: Strempel, Tom, Eisenecker, Ulrich, Siegmund, Norbert, and Universität Leipzig
Subjects: ddc:000, energy optimization c software
Abstract: In 2016 German data centers consumed 12.4 terawatt-hours of electrical energy, which accounts for about 2% of Germany’s total energy consumption in that year. In 2020 this rose to 16 terawatt-hours or 2.9% of Germany’s total energy consumption in that year. The ever-increasing energy consumption of computers consequently leads to considerations to reduce it to save energy, money and to protect the environment. This thesis aims to answer fundamental questions about the energy consumption of software, e. g. how and how precise can a measurement be taken or if CPU load and energy consumption are correlated. An overview of measurement methods and the related software tooling was created. The most promising approach using software called 'Scaphandre' was chosen as the main basis and further developed. Different sorting algorithms were benchmarked to study their behavior regarding energy consumption. The resulting dataset was also used to answer the fundamental questions stated in the beginning. A replication and reproduction package was provided to enable the reproducibility of the results. Im Jahr 2016 verbrauchten deutsche Rechenzentren 12,4 Terawattstunden elektrische Energie, was etwa 2 % des gesamten Energieverbrauchs in Deutschland in diesem Jahr ausmacht. Im Jahr 2020 stieg dieser Wert auf 16 Terawattstunden bzw. 2,9 % des Gesamtenergieverbrauchs in Deutschland. Der stetig steigende Energieverbrauch von Computern führt folglich zu Überlegungen, diesen zu reduzieren, um Energie und Geld zu sparen und die Umwelt zu schützen. Ziel dieser Arbeit ist es, grundlegende Fragen zum Energieverbrauch von Software zu beantworten, z. B. wie und mit welcher Genauigkeit gemessen werden kann oder ob CPU-Last und Energieverbrauch korrelieren. Es wurde eine Übersicht über Messmethoden und die dazugehörigen Softwaretools erstellt. Der vielversprechendste Ansatz mit der Software 'Scaphandre' wurde als Hauptgrundlage ausgewählt und weiterentwickelt. Verschiedene Sortieralgorithmen wurden einem Benchmarking unterzogen, um ihr Verhalten hinsichtlich des Energieverbrauchs zu untersuchen. Der resultierende Datensatz wurde auch zur Beantwortung der eingangs gestellten grundlegenden Fragen verwendet. Ein Replikations- und Reproduktionspaket wurde bereitgestellt, um die Reproduzierbarkeit der Ergebnisse zu ermöglichen.
Published: 2021

28. Entwicklung und Evaluation der Darstellung von Testabdeckungen in Getaviz

Author: Sillus, Aaron, Eisenecker, Ulrich, Kovacs, Pascal, and Universität Leipzig
Subjects: Softwareevaluation, Usability Engineering, Getaviz, ddc:000, Softwarevisualisierung, Testabdeckung, Softwareentwicklung
Abstract: Die Softwarevisualisierung nutzt unter anderem dreidimensionale Modelle zur Darstellung von Software. Diese Modelle erlauben die Exploration von Softwareprojekten durch Interaktion mit einer 3D-Szene. Das Institut für Wirtschaftsinformatik der Universität Leipzig entwickelt im Rahmen der Forschung auf diesem Gebiet das Programm Getaviz, welches verschiedene Funktionen umfasst, um die Analyse von Software zu unterstützen. Im Rahmen der vorliegenden Arbeit wird eine Erweiterung zur Darstellung der Testabdeckung in Getaviz entwickelt. Hierzu werden Techniken aus dem Usability Engineering verwendet, um eine hohe Benutzungsfreundlichkeit zu erreichen. Insbesondere findet der Entwicklungsprozess in mehreren Iterationen statt, in denen das Design durch eine formative Untersuchung bewertet und für die nächste Iteration angepasst wird. Der Entwicklungsprozess sowie der finale Stand sind außerdem auf GitHub (https://github.com/AaronSil/Getaviz/tree/development) als Repository dokumentiert.:Inhaltsverzeichnis Abbildungsverzeichnis Tabellenverzeichnis 1 Einleitung 1.1 Motivation und Problemstellung 1.2 Ziel und Aufbau der Arbeit 2 Grundlagen 2.1 Softwarevisualisierung 2.2 Getaviz 2.3 Testabdeckung 2.4 Testabdeckung in der Softwarevisualisierung 2.5 Usability-Engineering 3 Konzeption des Prototyps 3.1 Vorgehen 3.2 Anforderungsanalyse 3.2.1 Eingrenzung des Umfangs 3.2.2 Funktionale Anforderungen 3.2.3 Nicht-funktionale Anforderungen 3.2.4 Zielstellung für die erste Iteration 4 Konzeption der Evaluation 4.1 Untersuchungsgegenstand und -design 4.2 Methoden 4.3 Testdesign 4.3.1 Vorbereitung und Aufbau 4.3.2 Durchführung 4.3.3 Nachbereitung 5 Durchführung der Evaluation 5.1 Stichprobenzusammensetzung 5.2 Erste Iteration 5.3 Zweite Iteration 5.4 Dritte Iteration 6 Implementierung des Prototyps 6.1 Erweiterung des Generators 6.2 Sourcecode-Controller 6.3 Treemap 6.4 Sphären 6.5 Color-Coding 6.6 Farb-Controller 6.7 Experiment-Popover-Fenster 6.8 Tooltip-Controller 6.9 Package Explorer 6.10 Sonstige Features 7 Untersuchungsergebnisse 7.1 Kategorisierung der Ergebnisse 7.2 Interpretation der Ergebnisse 7.3 Diskussion 8 Fazit und Ausblick Literaturverzeichnis Selbstständigkeitserklärung Anhang Anhang 1: Fragebogen Anhang 2: Interviewleitfaden Anhang 3: Eckdaten der Iterationen Anhang 4: Szenarien Anhang 5: Implementierung des Prototyps Anhang 6: Auswertung - Fragebogen Anhang 7: Auswertung - Findings
Published: 2021

29. Optimierung der Visualisierung eines Dashboards für das Microservice-Monitoring

Author: Urban, Dario and Universität Leipzig
Subjects: ddc:000, Microservices, Microservice-Monitoring, Monitoring-Dashboard, Linkerd
Abstract: Microservice-Architekturen haben sich mittlerweile etabliert und werden von immer mehr Firmen übernommen. Die erhöhte Komplexität der Microservice-Architektur aufgrund der Verteilung des Systems hat jedoch zur Folge, dass die effiziente und erfolgreiche Administration des Systems erschwert wird. Ziel der Arbeit ist es, alle notwendigen Metriken für ein effizientes und erfolgreiches Microservice-Monitoring zu identifizieren und auf Basis dieser Erkenntnisse das Linkerd-Dashboard prototypisch weiterzuentwickeln. Hierfür wurde eine Literaturrecherche durchgeführt. Darüber hinaus wurden Praktiker mittels eines Online-Fragebogens befragt. Abschließend wurde die prototypische Weiterentwicklung mithilfe eines halbstrukturierten Interviews evaluiert. Die Literaturrecherche ergab, dass Central-Processing-Unit (CPU)- und Random-Access-Memory (RAM)-Nutzung, Antwortzeit, Workload, Fehlerrate und Service-Interaktion eine Metrik-Menge sind, mit der Microservice-Architekturen effektiv überwacht werden können. Außerdem konnte konstatiert werden, dass die Darstellung der Metriken hauptsächlich mit Visualisierungen realisiert wird. CPU- und RAM-Auslastung sind eine sinnvolle Erweiterung des Linkerd-Dashboards, da diese in der Literatur sowie im Fragebogen als wichtige Kennzahlen deklariert und alle anderen als essenziell eingestuften Metriken bereits vom Linkerd-Dashboard abgedeckt werden. Der Prototyp wurde als gelungen eingestuft, benötigt aber einige kleinere Verbesserungen der Visualisierung, bevor er in der Produktion eingesetzt werden kann.
Published: 2021

30. Topology of Uncertain Scalar Fields

Author: Liebmann, Tom, Scheuermann, Gerik, Theisel, Holger, and Universität Leipzig
Subjects: ddc:000, scalar fields, topology, visualization, Skalarfelder, Topologie, Visualisierung
Abstract: Scalar fields are used in many disciplines to represent scalar quantities over some spatial domain. Their versatility and the potential to model a variety of real-world phenomena has made scalar fields a key part of modern data analysis. Examples range from modeling scan results in medical applications (e.g. Magnetic Resonance Imaging or Computer Tomography), measurements and simulations in climate and weather research, or failure criteria in material sciences. But one thing that all applications have in common is that the data is always affected by errors. In measurements, potential error sources include sensor inaccuracies, an unevenly sampled domain, or unknown external influences. In simulations, common sources of error are differences between the model and the simulated phenomenon or numerical inaccuracies. To incorporate these errors into the analysis process, the data model can be extended to include uncertainty. For uncertain scalar fields that means replacing the single value that is given at every location with a value distribution. While in some applications, the influence of uncertainty might be small, there are a lot of cases where variations in the data can have a large impact on the results. A typical example is weather forecasts, where uncertainty is a crucial part of the data analysis. With increasing access to large sensor networks and extensive simulations, the complexity of scalar fields often grows to a point that makes analysis of the raw data unfeasible. In such cases, topological analysis has proven to be a useful tool for reducing scalar fields to their fundamental properties. Scalar field topology studies structures that do not change under transformations like scaling and bending but only depend on the connectivity and relative value differences between the points of the domain. While a lot of research has been done in this area for deterministic scalar fields, the incorporation of uncertainty into topological methods has only gained little attention so far. In this thesis, several methods are introduced that deal with the topological analysis of uncertain scalar fields. The main focus lies on providing fundamental research on the topic and to drive forward a rigorous analysis of the influence of uncertainty on topological properties. One important property that has a strong influence on topological features are stochastic dependencies between different locations in the uncertain scalar field. In the first part of this thesis, we provide a method for extracting regions that show linear dependency, i.e. correlation. Using a combination of point-cloud density estimation, clustering, and scalar field topology, our method extracts a hierarchical clustering. Together with an interactive visualization, the user can explore the correlation information and select and filter the results. A major benefit of our approach is the comprehensive handling of correlation. This also includes global correlation between distant points and inverse correlation, which is often only partially handled by existing methods. The second part of this thesis focuses on the extraction of topological features, such as critical points or hills and valleys of the scalar field. We provide a method for extracting critical points in uncertain scalar fields and track them over multiple realizations. Using a novel approach that operates in the space of all possible realizations, our method can find all critical points deterministically. This not only increases the reliability of the results but also provides complete knowledge that can be used to study the relation and behavior of critical points across different realizations. Through a combination of multiple views, we provide a visualization that can be used to analyze critical points of an uncertain scalar field for real-world data. In the last part, we further extend our analysis to more complex feature types. Based on the well-known contour tree that provides an abstract view on the topology of a deterministic scalar field, we use an approach that is similar to our critical point analysis to extract and track entire regions of the uncertain scalar field. This requires solving a series of new challenges that are associated with tracking features in the multi-dimensional space of all realizations. As our research on the topic falls under the category of fundamental research, there are still some limitations that have to be overcome in the future. However, we provide a full pipeline for extracting topological features that ranges from the data model to the final interactive visualization. We further show the applicability of our methods to synthetic and real-world data. Skalarfelder sind Funktionen, die jedem Punkt eines Raumes einen skalaren Wert zuweisen. Sie werden in vielen verschiedenen Bereichen zur Analyse von skalaren Messgrößen mit räumlicher Information eingesetzt. Ihre Flexibilität und die Möglichkeit, viele unterschiedliche Phänomene der realen Welt abzubilden, macht Skalarfelder zu einem wichtigen Werkzeug der modernen Datenanalyse. Beispiele reichen von medizinischen Anwendungen (z.B. Magnetresonanztomographie oder Computertomographie) über Messungen und Simulationen in Klima- und Wetterforschung bis hin zu Versagenskriterien in der Materialforschung. Eine Gemeinsamkeit all dieser Anwendungen ist jedoch, dass die erfassten Daten immer von Fehlern beeinflusst werden. Häufige Fehlerquellen in Messungen sind Sensorungenauigkeiten, ein ungleichmäßig abgetasteter Betrachtungsbereich oder unbekannte externe Einflussfaktoren. Aber auch Simulationen sind von Fehlern, wie Modellierungsfehlern oder numerischen Ungenauigkeiten betroffen. Um die Fehlerbetrachtung in die Datenanalyse einfließen lassen zu können, ist eine Erweiterung des zugrunde liegenden Datenmodells auf sogenannte \emph{unsicheren Daten} notwendig. Im Falle unsicherer Skalarfelder wird hierbei statt eines festen skalaren Wertes für jeden Punkt des Definitionsbereiches eine Werteverteilung angegeben, die die Variation der Skalarwerte modelliert. Während in einigen Anwendungen der Einfluss von Unsicherheit vernachlässigbar klein sein kann, gibt es viele Bereiche, in denen Schwankungen in den Daten große Auswirkungen auf die Resultate haben. Ein typisches Beispiel sind hierbei Wettervorhersagen, bei denen die Vertrauenswürdigkeit und mögliche alternative Ausgänge ein wichtiger Bestandteil der Analyse sind. Die ständig steigende Größe verfügbarer Sensornetzwerke und immer komplexere Simulationen machen es zunehmend schwierig, Daten in ihrer rohen Form zu verarbeiten oder zu speichern. Daher ist es wichtig, die verfügbare Datenmenge durch Vorverarbeitung auf für die jeweilige Anwendung relevante Merkmale zu reduzieren. Topologische Analyse hat sich hierbei als nützliches Mittel zur Verarbeitung von Skalarfeldern etabliert. Die Topologie eines Skalarfeldes umfasst all jene Merkmale, die sich unter bestimmten Transformationen, wie Skalierung und Verzerrung des Definitionsbereiches, nicht verändern. Hierzu zählen beispielsweise die Konnektivität des Definitionsbereiches oder auch die Anzahl und Beziehung von Minima und Maxima. Während die Topologie deterministischer Skalarfelder ein gut erforschtes Gebiet ist, gibt es im Bereich der Verarbeitung von Unsicherheit im topologischen Kontext noch viel Forschungspotenzial. In dieser Dissertation werden einige neue Methoden zur topologischen Analyse von unsicheren Skalarfeldern vorgestellt. Der wesentliche Teil dieser Arbeit ist hierbei im Bereich der Grundlagenforschung angesiedelt, da er sich mit der theoretischen und möglichst verlustfreien Verarbeitung von topologischen Strukturen befasst. Eine wichtige Eigenschaft, die einen starken Einfluss auf die Struktur eines unsicheren Skalarfeldes hat, ist die stochastische Abhängigkeit zwischen verschiedenen Punkten. Im ersten Teil dieser Dissertation wird daher ein Verfahren vorgestellt, das das unsichere Skalarfeld auf Regionen mit starker linearer Abhängigkeit, auch \emph{Korrelation} genannt, untersucht. Durch eine Kombination aus hochdimensionaler Punktwolkenanalyse, Clusterbildung und Skalarfeldtopologie extrahiert unsere Methode eine Hierarchie von Clustern, die die Korrelation des unsicheren Skalarfeldes repräsentiert. Zusammen mit einer interaktiven, visuellen Aufbereitung der Daten wird dem Nutzer so ein explorativer Ansatz zur Betrachtung der stochastischen Abhängigkeiten geboten. Anzumerken ist hierbei, dass unser Verfahren auch globale und inverse Korrelation abdeckt, welche in vielen verwandten Arbeiten oft nicht vollständig behandelt werden. Der zweite Teil dieser Dissertation widmet sich der Analyse und Extraktion von topologischen Merkmalen, wie kritischen Punkten oder ganzen Hügeln oder Tälern im Funktionsgraphen des Skalarfeldes. Hierzu wird ein Verfahren zur Berechnung von kritischen Punkten vorgestellt, das diese auch über viele verschiedene Realisierungen des unsicheren Skalarfeldes identifizieren und verfolgen kann. Dies wird durch einen neuen Ansatz ermöglicht, der den Raum aller möglichen Realisierungen nach geometrischen Strukturen untersucht und somit kritische Punkte deterministisch berechnen kann. Dadurch, dass mit diesem Verfahren keine kritischen Punkte ausgelassen werden, steigt nicht nur die Vertrauenswürdigkeit der Resultate, sondern es wird außerdem möglich, Beziehungen zwischen kritischen Punkten zu untersuchen. Zu diesen Beziehungen gehört beispielsweise das Wandern von kritischen Punkten über verschiedene Positionen oder auch die Entstehung von Skalarwerthügeln oder -tälern. Um die Resultate visuell zu untersuchen, stellen wir mehrere verknüpfte Ansichten bereit, die eine Analyse von kritischen Punkten auch in realen Daten ermöglichen. Im letzten Teil dieser Arbeit erweitern wir die Betrachtung der Topologie von kri\-ti\-schen Punkten auf komplexere Strukturen. Basierend auf dem \emph{Konturbaum}, der eine abstrakte Repräsentation der Topologie eines deterministischen Skalarfeldes ermöglicht, untersuchen wir, wie ganze Regionen des Skalarfeldes von Unsicherheit betroffen sind. Dies führt zu einer Reihe von neuen theoretischen und auch praktischen Herausforderungen, wie der stark steigenden Komplexität der notwendigen Berechnungen oder Inkonsistenzen bei der Verfolgung von topologischen Strukturen über mehrere Realisierungen. Auch wenn zur Anwendung unserer Verfahren auf reale Daten aufgrund des großen Möglichkeitsraumes von unsicheren Skalarfeldern noch Einschränkungen notwendig sind, sind viele der theoretischen Erkenntnisse allgemeingültig. Zur Betrachtung der Ergebnisse werden verschiedene Visualisierungen genutzt, um die extrahierten topologischen Strukturen anhand von synthetischen und realen Daten zu zeigen.
Published: 2020

31. Computational Characterization of Long Non-Coding RNAs

Author: Sen, Rituparno and Universität Leipzig
Subjects: ddc:000, lncRNAs, NGS, annotation, lncRNA function, machine learning
Abstract: In a cell, the DNA undergoes transcription to form mature transcripts, some of which in turn undergo translation to form proteins. Although over 85% of the human genome is transcribed, it comprises only about 2% protein-coding genes, the rest being noncoding. One of the non-coding gene elements, called long non-coding RNAs (lncRNAs), are emerging as key players in various regulatory roles in the human genome. The generally accepted theory posits lncRNAs to be over 200 nucleotides long and to be able to grow over 10 kilobases, bearing a similarity with mRNAs. The majority of lncRNAs undergo alternative splicing and are weakly polyadenylated in combination with complex secondary structures. Among the annotated lncRNAs, so far it has been only a meagre portion for which functional roles have been detected, while functions of the vast majority remain to be discovered. Observed functional roles include thus far gene expression regulation through various mechanisms at transcriptional and post-transcriptional levels. With the advent of next-generation sequencing (NGS) and advances in RNA sequencing technology (RNA-Seq), it is easier to reconstruct the transcriptome by extracting information about the splicing machinery. RNA-Seq has helped consortia like GENCODE, ENCODE, and others to curate their annotation catalogues. In this PhD thesis, certain aspects of the human lncRNA transcriptome will be explored, such as the challenges in lncRNA annotation. Those challenges stem from the lack of signals that are common in mRNAs and make them easier to detect, for instance signals of ORFs and transcription start sites. Concurrently, owing to a lack of understanding of the connection between sequence and function, lncRNAs have been typically annotated based upon their location in relation to mRNAs and their functions have been predicted through a guilt-by-association approach. In the first part of the PhD research work, the splice junctions in the lncRNA transcriptome were mapped in an attempt to explore the isoform diversity of lncRNAs by using sequencing data from B-cell lymphoma. In this phase of the research work, multiple junction-spanning reads from the sequencing data with a very large read depth were found to represent the splice junctions. Using GENCODE v19 as a reference it was found that the human transcriptome harbours a large number of rare exons and introns that have remained unannotated. Concomitantly, it can be inferred that the current human transcriptome annotation is confined to a very well-defined set of splice variants. However, although the isoforms are well-defined, the same cannot be said about their biological functions and it remains to be explored why the processing machinery of lncRNAs is restricted to a set of very few splice sites. In the human genome, small regulatory RNAs like miRNAs and small nucleolar RNAs (snoRNAs) overlap with lncRNAs in their genomic loci. To further understand the human transcriptome, in the second part of the PhD research work, a study was undertaken in an attempt to distinguish the miRNA and snoRNA hosting lncRNAs from the lncRNAs that did not have any overlaps with the smaller RNAs. To this end, machine learning techniques were implemented on curated datasets employing features inspired by a few of the prevalent features used in published lncRNA detection tools encompassing not just sequence information, but also secondary structure and conservation information. Classification was attempted through supervised as well as unsupervised learning approaches; random forests for the former, PCA and k-means for the latter. In the end, the three RNA classes could not be separated with certitude, especially when the hosted RNA was not supplied to the classifier, however, this lack of detectable association can be confirmed to be of biological interest. It suggests that the function of host genes is not closely tied to the function of the hosted genes at least in this case. Nevertheless, understanding the dynamics of snoRNA and miRNA host genes can improve the knowledge of functional evolution of lncRNAs, as the fact that the smaller RNA genes are conserved makes it comparably easier to trace the host lncRNAs over much larger evolutionary timescales than most other lncRNAs. With the accelerated availability of sequencing techniques it can be expected that expanded investigation into conservation patterns and host gene functions will be possible in the near future.
Published: 2020

32. Evolution of DNA methylation across Metazoa

Author: Engelhardt, Jan and Universität Leipzig
Subjects: DNA methylation, evolution, DNA methyltransferases, Ecdysozoa, Vertebrata, zebrafish, ddc:000
Abstract: DNA methylation is a crucial, abundant mechanism of gene regulation in vertebrates. It is less prevalent in many other metazoan organisms and completely absent in some key model species, such as D. melanogaster and C. elegans. In this thesis we report on a comprehensive study of the pres- ence and absence of DNA methyltransferases (DNMTs) in 138 Ecdysozoa covering Arthropoda, Nematoda, Priapulida, Onychophora, and Tardigrada. We observe that loss of individual DNMTs independently occured multiple times across ecdysozoan phyla. In several cases, this resulted in a loss of DNA methylation. In vertebrates, however, there is no single species known which lost DNA methylation. Actually, DNA methylation was greatly expanded after the 1R/2R whole genome duplication (WGD) and became a genome-wide phe- nomena. In our study of vertebrates we are not looking for losses of DNA methyltransferases and DNA methylation but are rather interested in the gain of additional DNA methyltransferase genes. In vertebrates there were a number of WGD. Most vertebrates only underwent two WGD but in the teleost lineage a third round of WGD occured and in some groups, e.g. Salmoniformes and some Cypriniformes even a forth WGD occured. The Carp-specific WGD (4R) is one of the most recent vertebrate WGD and is estimated to have occured 12.4 mya. We performed the most comprehen- sive analysis of the evolution of DNA methyltransferases after vertebrate whole-genome duplications (WGD) so far. We were able to show that the conservation of duplicated DNMT3 genes in Salmoniformes is more diverse than previously believed. We were also able to identify DNA methyltrans- ferases in Cypriniformes which have, due to their recent WGD, quite com- plex genomes. Our results show that the patterns of retained and lost DNA methyltransferases after a forth round of WGD differ between Cypriniformes and Salmoniformes. We also proposed a new nomenclature for teleost DNMT genes which correctly represents the orthology of DNMT genes for all teleost species. Next to these purely computational projects we collaborated with the Aluru lab to investigate the effects of different disturbances on zebrafish DNA methylation. One disturbance is the inactivation of DNMT3aa and DNMT3ab as single knockouts as well as a double knockout. This was the first double knockout of DNMT genes in zebrafish which was ever generated. It allows us to study the subfunctionalization of the two DNMT3a genes their effect on genome-wide DNA methylation. Given our results we hypothesize that DNMT3aa and DNMT3ab can compensate for each other to a high de- gree. DNMT3a genes have likely been subfuntionalized but their loss can be compensated by DNMT3b genes. This compensation by DNMT3b genes works well enough that no notable phenotype can be observed in double knockout zebrafish but a difference is notable on the epigenome level. The second disturbance we studied is the exposure of zebrafish to the toxic chemi- cal PCB126. We detected a moderate level of DNA methylation changes and a much larger effect on gene expression. Similar to previous reports we find little correlation between DNA methylation and gene expression changes. Therefore, while PCB126 exposure has a negative effect on DNA methyla- tion it is likely that other gene regulatory mechanisms play a role as well, possibly even a greater one. How do genes evolve and how are genes regulated are two of the main questions of modern molecular biology. In this thesis we have tried to shed more light on both questions. we have broadly expanded the phylogenetic range of species with a manually curated set of DNA methyltransferases. We have done this for ecdysozoan species which have lost all DNA methylating enzymes as well as for teleost fish which acquired more than ten copies of the, originally, two genes. We were also able to generate new insight into the subfunctionalization of the DNA methylation machinery in zebrafish and how it reacts to environmental effects.:1 Introduction 1.1 Biological introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Detecting DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Evolution of DNA methylation across Ecdysozoa 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Evolution of DNA methyltransferases after vertebrate whole genome duplications 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 The effect of DNMT3aa and DNMT3ab knockout on DNA methyla- tion in zebrafish 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5 Role of DNA methylation in altered testis gene expression patterns in adult zebrafish exposed to Pentachlorobiphenyl 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6 Conclusions 6.1 Evolution of DNA methylation across Ecdysozoa . . . . . . . . . . . . . 95 6.2 Evolution of DNA methyltransferases after vertebrate whole genome duplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3 Role of DNA methylation in altered testis gene expression patterns in adult zebrafish (Danio rerio) exposed to Pentachlorobiphenyl (PCB 126). . . 107 6.4 Knockout of DNMT3aa and DNMT3ab in zebrafish (Danio rerio) . . . . . . 108 Bibliography 119
Published: 2020

33. Computational Gene Expression Deconvolution

Author: Otto, Dominik and Universität Leipzig
Subjects: Bayesian Model, Variational Inference, Dimensional Reduction, Immune Cell Characterization, Porstate Carcinoma, ddc:000
Abstract: Technologies such as micro-expression arrays and high-throughput sequenc- ing assays have accelerated research of genetic transcription in biological cells. Furthermore, many links between the gene expression levels and the pheno- typic characteristics of cells have been discovered. Our current understanding of transcriptomics as an intermediate regulatory layer between genomics and proteomics raises hope that we will soon be able to decipher many more cel- lular mechanisms through the exploration of gene transcription. However, although large amounts of expression data are measured, only lim- ited information can be extracted. One general problem is the large set of considered genomic features. Expression levels are often analyzed individually because of limited computational resources and unknown statistical dependen- cies among the features. This leads to multiple testing issues or can lead to overfitting models, commonly referred to as the “curse of dimensionality.” Another problem can arise from ignorance of measurement uncertainty. In particular, approaches that consider statistical significance can suffer from underestimating uncertainty for weakly expressed genes and consequently re- quire subjective manual measures to produce consistent results (e.g., domain- specific gene filters). In this thesis, we lay out a theoretical foundation for a Bayesian interpretation of gene expression data based on subtle assumptions. Expression measure- ments are related to latent information (e.g., the transcriptome composition), which we formulate as a probability distribution that represents the uncer- tainty over the composition of the original sample. Instead of analyzing univariate gene expression levels, we use the multivari- ate transcriptome composition space. To realize computational feasibility, we develop a scalable dimensional reduction that aims to produce the best approximation that can be used with the computational resources available. To enable the deconvolution of gene expression, we describe subtissue specific probability distributions of expression profiles. We demonstrate the suitabil- ity of our approach with two deconvolution applications: first, we infer the composition of immune cells, and second we reconstruct tumor-specific ex- pression patterns from bulk-RNA-seq data of prostate tumor tissue samples.:1 Introduction 1 1.1 State of the Art and Motivation 2 1.2 Scope of this Thesis 5 2 Notation and Abbreviations 7 2.1 Notations 7 2.2 Abbreviations 9 3 Methods 10 3.1 The Convolution Assumption 10 3.2 Principal Component Analysis 11 3.3 Expression Patterns 11 3.4 Bayes’ Theorem 12 3.5 Inference Algorithms 13 3.5.1 Inference Through Sampling 13 3.5.2 Variationa lInference 14 4 Prior and Conditional Probabilities 16 4.1 Mixture Coefficients 16 4.2 Distribution of Tumor Cell Content 18 4.2.1 Optimal Tumor Cell Content Drawing 20 4.3 Transcriptome Composition Distribution 21 4.3.1 Sequencing Read Distribution 21 4.3.1.1 Empirical Plausibility Investigation 25 4.3.2 Dirichletand Normality 29 4.3.3 Theta◦logTransformation 29 4.3.4 Variance Stabilization 32 4.4 Cell and Tissue-Type-Specific Expression Pattern Distributions 32 4.4.1 Method of Moments and Factor Analysis 33 4.4.1.1 Tumor Free Cells 33 4.4.1.2 Tumor Cells 34 4.4.2 Characteristic Function 34 4.4.3 Gaussian Mixture Model 37 4.5 Prior Covariance Matrix Distribution 37 4.6 Bayesian Survival Analysis 38 4.7 Demarcation from Existing Methods 40 4.7.1 Negative Binomial Distribution 40 4.7.2 Steady State Assumption 41 4.7.3 Partial Correlation 41 4.7.4 Interaction Networks 42 5 Feasibility via Dimensional Reduction 43 5.1 DR for Deconvolution of Expression Patterns 44 5.1.1 Systematically Differential Expression 45 5.1.2 Internal Distortion 46 5.1.3 Choosinga DR 46 5.1.4 Testing the DR 47 5.2 Transformed Density Functions 49 5.3 Probability Distribution of Mixtures in DR Space 50 5.3.1 Likelihood Gradient 51 5.3.2 The Theorem 52 5.3.3 Implementation 52 5.4 DR for Inference of Cell Composition 53 5.4.1 Problem Formalization 53 5.4.2 Naive PCA 54 5.4.3 Whitening 55 5.4.3.1 Covariance Inflation 56 5.4.4 DR Through Optimization 56 5.4.4.1 Starting Point 57 5.4.4.2 The Optimization Process 58 5.4.5 Results 59 5.5 Interpretation of DR 61 5.6 Comparison to Other DRs 62 5.6.1 Weighted Correlation Network Analysis 62 5.6.2 t-Distributed Stochastic Neighbor Embedding 65 5.6.3 Diffusion Map 66 5.6.4 Non-negativeMatrix Factorization 66 5.7 Conclusion 67 6 Data for Example Application 68 6.1 Immune Cell Data 68 6.1.1 Provided List of Publicly Available Data 68 6.1.2 Obtaining the Publicly Available RNA-seq Data 69 6.1.3 Obtaining the Publicly Available Expression Microarray Data 71 6.1.4 Data Sanitization 71 6.1.4.1 A Tagging Tool 72 6.1.4.2 Tagging Results 73 6.1.4.3 Automatic Sanitization 74 6.1.5 Data Unification 75 6.1.5.1 Feature Mapping 76 6.1.5.2 Feature Selection 76 6.2 Examples of Mixtures with Gold Standard 79 6.2.1 Expression Microarray Data 81 6.2.2 Normalized Expression 81 6.2.3 Composition of the Gold Standard 82 6.3 Tumor Expression Data 82 6.3.1 Tumor Content 82 6.4 Benchmark Reference Study 83 6.4.1 Methodology 83 6.4.2 Reproduction 84 6.4.3 Reference Hazard Model 85 7 Bayesian Models in Example Applications 87 7.1 Inference of Cell Composition 87 7.1.1 The Expression Pattern Distributions (EPDs) 88 7.1.2 The Complete Model 89 7.1.3 Start Values 89 7.1.4 Resource Limits 90 7.2 Deconvolution of Expression Patterns 91 7.2.1 The Distribution of Expression Pattern Distribution 91 7.2.2 The Complete Model 92 7.2.3 SingleSampleDeconvolution 93 7.2.4 A Simplification 94 7.2.5 Start Values 94 8 Results of Example Applications 96 8.1 Inference of Cell Composition 96 8.1.1 Single Composition Output 96 8.1.2 ELBO Convergence in Variational Inference 97 8.1.3 Difficulty-Divergence 97 8.1.3.1 Implementing an Alternative Stick-Breaking 98 8.1.3.2 Using MoreGeneral Inference Methods 99 8.1.3.3 UsingBetterData 100 8.1.3.4 Restriction of Variance of Cell-Type-Specific EPDs 100 8.1.3.5 Doing Fewer Iterations 100 8.1.4 Difficulty-Bias 101 8.1.5 Comparison to Gold Standard 101 8.1.6 Comparison to Competitors 101 8.1.6.1 Submission-Aginome-XMU 105 8.1.6.2 Submission-Biogem 105 8.1.6.3 Submission-DA505 105 8.1.6.4 Submission-AboensisIV 105 8.1.6.5 Submission-mittenTDC19 106 8.1.6.6 Submission-CancerDecon 106 8.1.6.7 Submission-CCB 106 8.1.6.8 Submission-D3Team 106 8.1.6.9 Submission-ICTD 106 8.1.6.10 Submission-Patrick 107 8.1.6.11 Conclusion for the Competitor Review 107 8.1.7 Implementation 107 8.1.8 Conclusion 108 8.2 Deconvolution of Expression Patterns 108 8.2.1 Difficulty-Multimodality 109 8.2.1.1 Order of Kernels 109 8.2.1.2 Posterior EPD Complexity 110 8.2.1.3 Tumor Cell Content Estimate 110 8.2.2 Difficulty-Time 110 8.2.3 The Inference Process 111 8.2.3.1 ELBO Convergence in Variational Inference 111 8.2.4 Posterior of Tumor Cell Content 112 8.2.5 Posterior of Tissue Specific Expression 112 8.2.6 PosteriorHazardModel 113 8.2.7 Gene Marker Study with Deconvoluted Tumor Expression 115 8.2.8 Hazard Model Comparison Overview 116 8.2.9 Implementation 116 9 Discussion 117 9.1 Limitations 117 9.1.1 Simplifying Assumptions 117 9.1.2 Computation Resources 118 9.1.3 Limited Data and Suboptimal Format 118 9.1.4 ItIsJustConsistency 119 9.1.5 ADVI Uncertainty Estimation 119 9.2 Outlook 119 9.3 Conclusion 121 A Appendix 123 A.1 Optimalα 123 A.2 Digamma Function and Logarithm 123 A.3 Common Normalization 124 A.3.1 CPMNormalization 124 A.3.2 TPMNormalization 124 A.3.3 VSTNormalization 125 A.3.4 PCA After Different Normalizations 125 A.4 Mixture Prior Per Tissue Source 125 A.5 Data 125 A.6 Cell Type Characterization without Whitening 133 B Proofs 137 Bibliography 140
Published: 2020

34. Developing and Utilizing the Concept of Affine Linear Neighborhoods in Flow Visualization

Author: Koch, Stefan, Scheuermann, Gerik, Sadlo, Filip, Hlawitschka, Mario, and Universität Leipzig
Subjects: flow visualization, vector field visualization, vector field topology, linear approximation, ddc:000, Strömungsvisualisierung, Vektorfeldvisualisierung, Vektorfeldtopologie, Lineare Approximation
Abstract: In vielen Forschungsbereichen wie Medizin, Natur- oder Ingenieurwissenschaften spielt die wissenschaftliche Visualisierung eine wichtige Rolle und hilft Wissenschaftlern neue Erkenntnisse zu gewinnen. Der Hauptgrund hierfür ist, dass Visualisierungen das Unsichtbare sichtbar machen können. So können Visualisierungen beispielsweise den Verlauf von Nervenfasern im Gehirn von Probanden oder den Luftstrom um Hindernisse herum darstellen. Diese Arbeit trägt insbesondere zum Teilgebiet der Strömungsvisualisierung bei, welche sich mit der Untersuchung von Prozessen in Flüssigkeiten und Gasen beschäftigt. Eine beliebte Methode, um Einblicke in komplexe Datensätze zu erhalten, besteht darin, einfache und bekannte Strukturen innerhalb eines Datensatzes aufzuspüren. In der Strömungsvisualisierung führt dies zum Konzept der lokalen Linearisierung und Linearität im Allgemeinen. Dies liegt daran, dass lineare Vektorfelder die einfachste Form von nicht-trivialen Feldern darstellen und diese sehr gut verstanden sind. In der Regel werden simulierte Datensätze in einzelne Zellen diskretisiert, welche auf linearer Interpolation basieren. Beispielsweise können auch stationäre Punkte in der Vektorfeldtopologie mittels linearen Strömungsverhaltens charakterisiert werden. Daher ist Linearität allgegenwärtig. Durch das Verständnis von lokalen linearen Strömungsverhalten in Vektorfeldern konnten verschiedene Visualisierungsmethoden erheblich verbessert werden. Ähnliche Erfolge sind auch für andere Methoden zu erwarten. In dieser Arbeit wird das Konzept der Linearität in der Visualisierung weiterentwickelt. Zunächst wird eine bestehende Definition von linearen Nachbarschaften hin zu affin-linearen Nachbarschaften erweitert. Affin-lineare Nachbarschaften sind Regionen mit einem überwiegend linearem Strömungsverhalten. Es wird eine detaillierte Diskussion über die Definition sowie die gewählten Fehlermaße durchgeführt. Weiterhin wird ein Region Growing-Verfahren vorgestellt, welches affin-lineare Nachbarschaften um beliebige Positionen bis zu einem bestimmten, benutzerdefinierten Fehlerschwellwert extrahiert. Um die lokale Linearität in Vektorfeldern zu messen, wird ein komplementärer Ansatz, welcher die Qualität der bestmöglichen linearen Näherung für eine gegebene n-Ring-Nachbarschaft berechnet, diskutiert. In einer ersten Anwendung werden affin-lineare Nachbarschaften an stationären Punkten verwendet, um deren Einflussbereich sowie ihre Wechselwirkung mit der sie umgebenden, nichtlinearen Strömung, aber auch mit sehr nah benachbarten stationären Punkten zu visualisieren. Insbesondere bei sehr großen Datensätzen kann die analytische Beschreibung der Strömung innerhalb eines linearisierten Bereichs verwendet werden, um Vektorfelder zu komprimieren und vorhandene Visualisierungsansätze zu beschleunigen. Insbesondere sollen eine Reihe von Komprimierungsalgorithmen für gitterbasierte Vektorfelder verbessert werden, welche auf der sukzessiven Entfernung einzelner Gitterkanten basieren. Im Gegensatz zu vorherigen Arbeiten sollen affin-lineare Nachbarschaften als Grundlage für eine Segmentierung verwendet werden, um eine obere Fehlergrenze bereitzustellen und somit eine hohe Qualität der Komprimierungsergebnisse zu gewährleisten. Um verschiedene Komprimierungsansätze zu bewerten, werden die Auswirkungen ihrer jeweiligen Approximationsfehler auf die Stromlinienintegration sowie auf integrationsbasierte Visualisierungsmethoden am Beispiel der numerischen Berechnung von Lyapunov-Exponenten diskutiert. Zum Abschluss dieser Arbeit wird eine mögliche Erweiterung des Linearitätbegriffs für Vektorfelder auf zweidimensionalen Mannigfaltigkeiten vorgestellt, welche auf einer adaptiven, atlasbasierten Vektorfeldzerlegung basiert. In many research areas, such as medicine, natural sciences or engineering, scientific visualization plays an important role and helps scientists to gain new insights. This is because visualizations can make the invisible visible. For example, visualizations can reveal the course of nerve fibers in the brain of test persons or the air flow around obstacles. This thesis in particular contributes to the subfield of flow visualization, which targets the investigation of processes in fluids and gases. A popular way to gain insights into complex datasets is to identify simple and known structures within a dataset. In case of flow visualization, this leads to the concept of local linearizations and linearity in general. This is because linear vector fields represent the most simple class of non-trivial fields and they are extremely well understood. Typically, simulated datasets are discretized into individual cells that are based on linear interpolation. Also, in vector field topology, stationary points can be characterized by considering the local linear flow behavior in their vicinity. Therefore, linearity is ubiquitous. Through the understanding of local linear flow behavior in vector fields by applying the concept of local linearity, some visualization methods have been improved significantly. Similar successes can be expected for other methods. In this thesis, the use of linearity in visualization is investigated. First, an existing definition of linear neighborhoods is extended towards the affine linear neighborhoods. Affine linear neighborhoods are regions of mostly linear flow behavior. A detailed discussion of the definition and of the chosen error measures is provided. Also a region growing algorithm that extracts affine linear neighborhoods around arbitrary positions up to a certain user-defined approximation error threshold is introduced. To measure the local linearity in vector fields, a complementary approach that computes the quality of the best possible linear approximation for a given n-ring neighborhood is discussed. As a first application, the affine linear neighborhoods around stationary points are used to visualize their region of influence, their interaction with the non-linear flow around them as well as their interaction with closely neighbored stationary points. The analytic description of the flow within a linearized region can be used to compress vector fields and accelerate existing visualization approaches, especially in case of very large datasets. In particular, the presented method aims at improving over a series of compression algorithms for grid-based vector fields that are based on edge collapse. In contrast to previous approaches, affine linear neighborhoods serve as the basis for a segmentation in order to provide an upper error bound and also to ensure a high quality of the compression results. To evaluate different compression approaches, the impact of their particular approximation errors on streamline integration as well as on integration-based visualization methods is discussed on the example of Finite-Time Lyapunov Exponent computations. To conclude the thesis, a first possible extension of linearity to fields on two-dimensional manifolds, based on an adaptive atlas-based vector field decomposition, is given.
Published: 2020

35. Weighted Logics and Weighted Simple Automata for Context-Free Languages of Infinite Words

Author: Dziadek, Sven, Droste, Manfred, Kuich, Werner, Honkala, Juha, and Universität Leipzig
Subjects: Weighted Automata, Pushdown Automata, Automata on Infinite Words, Greibach Normal Form, Weighted Logics, ddc:000, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Computer Science::Formal Languages and Automata Theory
Abstract: Büchi, Elgot and Trakhtenbrot provided a seminal connection between monadic second-order logic and finite automata for both finite and infinite words. This BET- Theorem has been extended by Lautemann, Schwentick and Thérien to context-free languages by introducing a monadic second-order logic with an additional existentially quantified second-order variable. This new variable models the stack of pushdown au- tomata. A fundamental study by Cohen and Gold extended the context-free languages to infinite words. Our first main result is a second-order logic in the sense of Lautemann, Schwentick and Thérien with the same expressive power as ω-context-free languages. For our argument, we investigate Greibach normal forms of ω-context-free grammars as well as a new type of Büchi pushdown automata, the simple pushdown automata. Simple pushdown automata do not use e-transitions and can change the stack only by at most one symbol. We show that simple pushdown automata of infinite words suffice to accept all ω-context-free languages. This enables us to use Büchi-type results recently developed for infinite nested words. In weighted automata theory, many classical results on formal languages have been extended into a quantitative setting. Weighted context-free languages of finite words trace back already to Chomsky and Schützenberger. Their work has been extended to infinite words by Ésik and Kuich. As in the theory of formal grammars, these weighted ω-context-free languages, or ω-algebraic series, can be represented as solutions of mixed ω-algebraic systems of equations and by weighted ω-pushdown automata. In our second main result, we show that (mixed) ω-algebraic systems can be trans- formed into Greibach normal form. We then investigate simple pushdown automata in the weighted setting. Here, we give our third main result. We prove that weighted simple pushdown automata of finite words recognize all weighted context-free languages, i.e., generate all algebraic series. Then, we show that weighted simple ω-pushdown automata generate all ω-algebraic series. This latter result uses the former result together with the Greibach normal form that we developed for ω-algebraic systems. As a fourth main result, we prove that for weighted simple ω-pushdown automata, Büchi-acceptance and Muller-acceptance are expressively equivalent. In our fifth main result, we derive a Nivat-like theorem for weighted simple ω- pushdown automata. This theorem states that the behaviors of our automata are precisely the projections of very simple ω-series restricted to ω-context-free languages. The last result, our sixth main result, is a weighted logic with the same expressive power as weighted simple ω-pushdown automata. To prove the equivalence, we use a similar result for weighted nested ω-word automata and apply our present result of expressive equivalence of Muller and Büchi acceptance.
Published: 2020

36. Advanced Methods for Entity Linking in the Life Sciences

Author: Christen, Victor and Universität Leipzig
Subjects: entity linking, entity resolution, medical documents, temporal group linkage, ddc:000
Abstract: The amount of knowledge increases rapidly due to the increasing number of available data sources. However, the autonomy of data sources and the resulting heterogeneity prevent comprehensive data analysis and applications. Data integration aims to overcome heterogeneity by unifying different data sources and enriching unstructured data. The enrichment of data consists of different subtasks, amongst other the annotation process. The annotation process links document phrases to terms of a standardized vocabulary. Annotated documents enable effective retrieval methods, comparability of different documents, and comprehensive data analysis, such as finding adversarial drug effects based on patient data. A vocabulary allows the comparability using standardized terms. An ontology can also represent a vocabulary, whereas concepts, relationships, and logical constraints additionally define an ontology. The annotation process is applicable in different domains. Nevertheless, there is a difference between generic and specialized domains according to the annotation process. This thesis emphasizes the differences between the domains and addresses the identified challenges. The majority of annotation approaches focuses on the evaluation of general domains, such as Wikipedia. This thesis evaluates the developed annotation approaches with case report forms that are medical documents for examining clinical trials. The natural language provides different challenges, such as similar meanings using different phrases. The proposed annotation method, AnnoMap, considers the fuzziness of natural language. A further challenge is the reuse of verified annotations. Existing annotations represent knowledge that can be reused for further annotation processes. AnnoMap consists of a reuse strategy that utilizes verified annotations to link new documents to appropriate concepts. Due to the broad spectrum of areas in the biomedical domain, different tools exist. The tools perform differently regarding a particular domain. This thesis proposes a combination approach to unify results from different tools. The method utilizes existing tool results to build a classification model that can classify new annotations as correct or incorrect. The results show that the reuse and the machine learning-based combination improve the annotation quality compared to existing approaches focussing on the biomedical domain. A further part of data integration is entity resolution to build unified knowledge bases from different data sources. A data source consists of a set of records characterized by attributes. The goal of entity resolution is to identify records representing the same real-world entity. Many methods focus on linking data sources consisting of records being characterized by attributes. Nevertheless, only a few methods can handle graph-structured knowledge bases or consider temporal aspects. The temporal aspects are essential to identify the same entities over different time intervals since these aspects underlie certain conditions. Moreover, records can be related to other records so that a small graph structure exists for each record. These small graphs can be linked to each other if they represent the same. This thesis proposes an entity resolution approach for census data consisting of person records for different time intervals. The approach also considers the graph structure of persons given by family relationships. For achieving qualitative results, current methods apply machine-learning techniques to classify record pairs as the same entity. The classification task used a model that is generated by training data. In this case, the training data is a set of record pairs that are labeled as a duplicate or not. Nevertheless, the generation of training data is a time-consuming task so that active learning techniques are relevant for reducing the number of training examples. The entity resolution method for temporal graph-structured data shows an improvement compared to previous collective entity resolution approaches. The developed active learning approach achieves comparable results to supervised learning methods and outperforms other limited budget active learning methods. Besides the entity resolution approach, the thesis introduces the concept of evolution operators for communities. These operators can express the dynamics of communities and individuals. For instance, we can formulate that two communities merged or split over time. Moreover, the operators allow observing the history of individuals. Overall, the presented annotation approaches generate qualitative annotations for medical forms. The annotations enable comprehensive analysis across different data sources as well as accurate queries. The proposed entity resolution approaches improve existing ones so that they contribute to the generation of qualitative knowledge graphs and data analysis tasks.
Published: 2020

37. Advanced Protocols for Peer-to-Peer Data Transmission in Wireless Gigabit Networks

Author: Friedrich, Jan, Lindemann, Christoph, and Universität Leipzig
Subjects: ddc:000, opportunistic networks, wireless networks, frame aggregation
Abstract: This thesis tackles problems on IEEE 802.11 MAC layer, network layer and application layer, to further push the performance of wireless P2P applications in a holistic way. It contributes to the better understanding and utilization of two major IEEE 802.11 MAC features, frame aggregation and block acknowledgement, to the design and implementation of opportunistic networks on off-the-shelf hardware and proposes a document exchange protocol, including document recommendation. First, this thesis contributes a measurement study of the A-MPDU frame aggregation behavior of IEEE 802.11n in a real-world, multi-hop, indoor mesh testbed. Furthermore, this thesis presents MPDU payload adaptation (MPA) to utilize A-MPDU subframes to increase the overall throughput under bad channel conditions. MPA adapts the size of MAC protocol data units to channel conditions, to increase the throughput and lower the delay in error-prone channels. The results suggest that under erroneous conditions throughput can be maximized by limiting the MPDU size. As second major contribution, this thesis introduces Neighborhood-aware OPPortunistic networking on Smartphones (NOPPoS). NOPPoS creates an opportunistic, pocket-switched network using current generation, off-the-shelf mobile devices. As main novel feature, NOPPoS is highly responsive to node mobility due to periodic, low-energy scans of its environment, using Bluetooth Low Energy advertisements. The last major contribution is the Neighborhood Document Sharing (NDS) protocol. NDS enables users to discover and retrieve arbitrary documents shared by other users in their proximity, i.e. in the communication range of their IEEE 802.11 interface. However, IEEE 802.11 connections are only used on-demand during file transfers and indexing of files in the proximity of the user. Simulations show that NDS interconnects over 90 \% of all devices in communication range. Finally, NDS is extended by the content recommendation system User Preference-based Probability Spreading (UPPS), a graph-based approach. It integrates user-item scoring into a graph-based tag-aware item recommender system. UPPS utilizes novel formulas for affinity and similarity scoring, taking into account user-item preference in the mass diffusion of the recommender system. The presented results show that UPPS is a significant improvement to previous approaches.
Published: 2020

38. Improving Academic Natural Language Processing Infrastructures Utilizing Cluster Computation

Author: Sahami, Soheila and Universität Leipzig
Subjects: Natural Language Processing, Cluster Computation, Big Data, Just-In-Time Delivery Systems, ddc:000
Abstract: In light of widespread digitization endeavors and ever-growing textual data generation, developing efficient academic Natural Language Processing (NLP) infrastructures, which can deal with large amounts of data, is of particular importance. Novel computation technologies allow tools that support big data and heavy computation while performing timely and cost-effective data processing. This development has led researchers to demand that knowledge be extracted from ever-increasing textual data before it is outdated. Cluster computation is a modern technology for handling big data efficiently. It provides distribution of computing and data over a number of machines in a cluster, as well as efficient use of resources, which are key requirements to process big data in a timely manner. It also assures applications’ high availability and fault tolerance, which are fundamental concerns when dealing with vast amounts of data. In addition, it provides load balancing of data during the execution of tasks, which results in optimal use of resources and enhances efficiency. Data-oriented parallelization is an effective solution to enable the currently available academic NLP infrastructures to process big data. This approach offers a solution to parallelize the NLP tools which comprise identical non-complicated tasks without the expense of changing NLP algorithms. This thesis presents the adaption of cluster computation technology to academic NLP infrastructures to address the notable features that are essential to process vast quantities of text materials efficiently, in terms of both resources and time. Apache Spark on top of Apache Hadoop and its ecosystem have been utilized to develop a set of NLP tools that provide a distributed environment to execute the NLP tasks. Many experiments were conducted to assess the functionality of the designated strategy. This thesis shows that using cluster computation technology and data-oriented parallelization enables academic NLP infrastructures to execute large amounts of textual data in a timely manner while improving the performance of the NLP tools. Moreover, these experiments provide information that brings a more realistic and transparent estimation of workflows’ costs (required hardware resources) and execution time, along with the fastest, optimum, or feasible resource configuration for each individual workflow. This knowledge can be employed by users to trade-off between run-time, size of data, and hardware, and it enables them to design a strategy for data storage, duration of data retention, and delivery time. This has the potential to enhance researchers’ satisfaction when using academic NLP infrastructures. The thesis also shows that a cluster computation approach provides the capacity to adapt NLP services with JIT delivery systems. The proposed strategy assures the reliability and predictability of the services, which are the main characteristics of the services in JIT delivery systems. Defining the relevant parameters, recording the behavior of the services, and analyzing the generated data resulted in the provision of knowledge that can be utilized to create a service catalog—a fundamental requirement for the services in JIT delivery systems—for each service offered. This knowledge also helps to generate the performance profiles for each item mentioned in the service catalog and to update them continuously to cover new experiments and improve service quality.
Published: 2020

39. Question Answering on RDF Data Cubes

Author: Höffner, Konrad and Universität Leipzig
Subjects: ddc:000, question answering, linked open data,semantic question answering, rdf data cubes
Abstract: The Semantic Web, a Web of Data, is an extension of the World Wide Web (WWW), a Web of Documents. A large amount of such data is freely available as Linked Open Data (LOD) for many areas of knowledge, forming the LOD Cloud. While this data conforms to the Resource Description Framework (RDF) and can thus be processed by machines, users need to master a formal query language and learn a specific vocabulary. Semantic Question Answering (SQA) systems remove those access barriers by letting the user ask natural language questions that the systems translate into formal queries. Thus, the research area of SQA plays an important role for the acceptance and benefit of the Semantic Web. The original contributions of this thesis to SQA are: First, we survey the current state of the art of SQA. We complement existing surveys by systematically identifying SQA publications in the chosen timeframe. 72 publications describing 62 different systems are systematically and manually selected using predefined inclusion and exclusion criteria out of 1960 candidates from the end of 2010 to July 2015. The survey identifies common challenges, structured solutions, and recommendations on research opportunities for future systems. From that point on, we focus on multidimensional numerical data, which is immensely valuable as it influences decisions in health care, policy and finance, among others. With the growth of the open data movement, more and more of it is becoming freely available. A large amount of such data is included in the LOD cloud using the RDF Data Cube (RDC) vocabulary. However, consuming multidimensional numerical data requires experts and specialized tools. Traditional SQA systems cannot process RDCs because their meta-structure is opaque to applications that expect facts to be encoded in single triples, This motivates our second contribution, the design and implementation of the first SQA algorithm on RDF Data Cubes. We kick-start this new research subfield by creating a user question corpus and a benchmark over multiple data sets. The evaluation of our system on the benchmark, which is included in the public Question Answering over Linked Data (QALD) challenge of 2016, shows the feasibility of the approach, but also highlights challenges, which we discuss in detail as a starting point for future work in the field. The benchmark is based on our final contribution, the addition of 955 financial government spending data sets to the LOD cloud by transforming data sets of the OpenSpending project to RDF Data Cubes. Open spending data has the power to reduce corruption by increasing accountability and strengthens democracy because voters can make better informed decisions. An informed and trusting public also strengthens the government itself because it is more likely to commit to large projects. OpenSpending.org is an open platform that provides public finance data from governments around the world. The transformation result, called LinkedSpending, consists of more than five million planned and carried out financial transactions in 955 data sets from all over the world as Linked Open Data and is freely available and openly licensed.:1 Introduction 1.1 Motivation 1.2 Research Questions and Contributions 1.3 Thesis Structure 2 Preliminaries 2.1 Semantic Web 2.1.1 URIs and URLs 2.1.2 Linked Data 2.1.3 Resource Description Framework 2.1.4 Ontologies 2.2 Question Answering 2.2.1 History 2.2.2 Definitions 2.2.3 Evaluation 2.2.4 SPARQL 2.2.5 Controlled Vocabulary 2.2.6 Faceted Search 2.2.7 Keyword Search 2.3 Data Cubes 3 Related Work 3.1 Semantic Question Answering 3.1.1 Surveys 3.1.2 Evaluation Campaigns 3.1.3 System Frameworks 3.2 Question Answering on RDF Data Cubes 3.3 RDF Data Cube Data Sets 4 Systematic Survey of Semantic Question Answering 4.1 Methodology 4.1.1 Inclusion Criteria 4.1.2 Exclusion Criteria 4.1.3 Result 4.2 Systems 4.2.1 Implementation 4.2.2 Examples 4.2.3 Answer Presentation 4.3 Challenges 4.3.1 Lexical Gap 4.3.2 Ambiguity 4.3.3 Multilingualism 4.3.4 Complex Queries 4.3.5 Distributed Knowledge 4.3.6 Procedural, Temporal and Spatial Questions 4.3.7 Templates 5 Question Answering on RDF Data Cubes 5.1 Question Corpus 5.2 Corpus Analysis 5.3 Data Cube Operations 5.4 Algorithm 5.4.1 Preprocessing 5.4.2 Matching 5.4.3 Combining Matches to Constraints 5.4.4 Execution 6 LinkedSpending 6.1 Choice of Source Data 6.1.1 Government Spending 6.1.2 OpenSpending 6.2 OpenSpending Source Data 6.3 Conversion of OpenSpending to RDF 6.4 Publishing 6.5 Overview over the Data Sets 6.6 Data Set Quality Analysis 6.6.1 Intrinsic Dimensions 6.6.2 Representational Dimensions 6.7 Evaluation 6.7.1 Experimental Setup and Benchmark 6.7.2 Discussion 7 Conclusion 7.1 Research Question Summary 7.2 SQA Survey 7.2.1 Lexical Gap 7.2.2 Ambiguity 7.2.3 Multilingualism 7.2.4 Complex Operators 7.2.5 Distributed Knowledge 7.2.6 Procedural, Temporal and Spatial Data 7.2.7 Templates 7.2.8 Future Research 7.3 CubeQA 7.4 LinkedSpending 7.4.1 Shortcomings 7.4.2 Future Work Bibliography Appendix A The CubeQA Question Corpus Appendix B The QALD-6 Task 3 Benchmark Questions B.1 Training Data B.2 Testing Data
Published: 2020

40. Expressiveness and Decidability of Weighted Automata and Weighted Logics

Author: Paul, Erik, Droste, Manfred, Seidl, Helmut, and Universität Leipzig
Subjects: TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES, Computer Science::Logic in Computer Science, ddc:000, Weighted Automata, Weighted Logics, Quantitative Model Theory, Feferman-Vaught Theorem, Max-Plus Tree Automata, Ambiguity, Quantitative Monitor Automata, Computer Science::Formal Languages and Automata Theory
Abstract: Automata theory, one of the main branches of theoretical computer science, established its roots in the middle of the 20th century. One of its most fundamental concepts is that of a finite automaton, a basic yet powerful model of computation. In essence, finite automata provide a method to finitely represent possibly infinite sets of strings. Such a set of strings is also called a language, and the languages which can be described by finite automata are known as regular languages. Owing to their versatility, regular languages have received a great deal of attention over the years. Other formalisms were shown to be expressively equivalent to finite automata, most notably regular grammars, regular expressions, and monadic second order (MSO) logic. To increase expressiveness, the fundamental idea underlying finite automata and regular languages was also extended to describe not only languages of strings, or words, but also of infinite words by Büchi and Muller, finite trees by Doner and Thatcher and Wright, infinite trees by Rabin, nested words by Alur and Madhusudan, and pictures by Blum and Hewitt, just to name a few examples. In a parallel line of development, Schützenberger introduced weighted automata which allow the description of quantitative properties of regular languages. In subsequent works, many of these descriptive formalisms and extensions were combined and their relationships investigated. For example, weighted regular expressions and weighted logics have been developed as well as regular expressions for trees and pictures, regular grammars for trees, pictures, and nested words, and logical characterizations for regular languages of trees, pictures, and nested words. In this work, we focus on two of these extensions and their relationship, namely weighted automata and weighted logics. Just as the classical Büchi-Elgot-Trakhtenbrot Theorem established the coincidence of regular languages with languages definable in monadic second order logic, weighted automata have been shown to be expressively equivalent to a specific fragment of a weighted monadic second order logic by Droste and Gastin. We explore several aspects of weighted automata and of this weighted logic. More precisely, the thesis considers the following topics. In the first part, we extend the classical Feferman-Vaught Theorem to the weighted setting. The Feferman-Vaught Theorem is one of the fundamental theorems in model theory. The theorem describes how the computation of the truth value of a first order sentence in a generalized product of relational structures can be reduced to the computation of truth values of first order sentences in the contributing structures and the evaluation of an MSO sentence in the index structure. The theorem itself has a long-standing history. It builds upon work of Mostowski, and was shown in subsequent works to hold true for MSO logic. Here, we show that under appropriate assumptions, the Feferman-Vaught Theorem also holds true for a weighted MSO logic with arbitrary commutative semirings as weight structure. In the second part, we lift four decidability results from max-plus word automata to max-plus tree automata. Max-plus word and tree automata are weighted automata over the max-plus semiring and assign real numbers to words or trees, respectively. We show that, like for max-plus word automata, the equivalence, unambiguity, and sequentiality problems are decidable for finitely ambiguous max-plus tree automata, and that the finite sequentiality problem is decidable for unambiguous max-plus tree automata. In the last part, we develop a logic which is expressively equivalent to quantitative monitor automata. Introduced very recently by Chatterjee, Henzinger, and Otop, quantitative monitor automata are an automaton model operating on infinite words. Quantitative monitor automata possess several interesting features. They are expressively equivalent to a subclass of nested weighted automata, an automaton model which for many valuation functions has decidable emptiness and universality problems. Also, quantitative monitor automata are more expressive than weighted Büchi-automata and their extension with valuation functions. We introduce a new logic which we call monitor logic and show that it is expressively equivalent to quantitative monitor automata.
Published: 2020

41. Structure, Dynamics and Self-Organization in Recurrent Neural Networks: From Machine Learning to Theoretical Neuroscience

Author: Vilimelis Aceituno, Pau and Universität Leipzig
Subjects: Recurrent Neural Networks, Reservoir Computing, Random Matrix Theory, Synaptic Plasticity, Quantitative Biology::Neurons and Cognition, ddc:000
Abstract: At a first glance, artificial neural networks, with engineered learning algorithms and carefully chosen nonlinearities, are nothing like the complicated self-organized spiking neural networks studied by theoretical neuroscientists. Yet, both adapt to their inputs, keep information from the past in their state space and are able of learning, implying that some information processing principles should be common to both. In this thesis we study those principles by incorporating notions of systems theory, statistical physics and graph theory into artificial neural networks and theoretical neuroscience models. % TO DO: What is different in this thesis? -> classical signal processing with complex systems on top The starting point for this thesis is \ac{RC}, a learning paradigm used both in machine learning\cite{jaeger2004harnessing} and in theoretical neuroscience\cite{maass2002real}. A neural network in \ac{RC} consists of two parts, a reservoir – a directed and weighted network of neurons that projects the input time series onto a high dimensional space – and a readout which is trained to read the state of the neurons in the reservoir and combine them linearly to give the desired output. In classical \ac{RC}, the reservoir is randomly initialized and left untrained, which alleviates the training costs in comparison to other recurrent neural networks. However, this lack of training implies that reservoirs are not adapted to specific tasks and thus their performance is often lower than that of other neural networks. Our contribution has been to show how knowledge about a task can be integrated into the reservoir architecture, so that reservoirs can be tailored to specific problems without training. We do this design by identifying two features that are useful for machine learning: the memory of the reservoir and its power spectra. First we show that the correlations between neurons limit the capacity of the reservoir to retain traces of previous inputs, and demonstrate that those correlations are controlled by moduli of the eigenvalues of the adjacency matrix of the reservoir. Second, we prove that when the reservoir resonates at the frequencies that are present on the desired output signal, the performance of the readout increases. Knowing the features of the reservoir dynamics that we need, the next question is how to impose them. The simplest way to design a network with that resonates at a certain frequency is by adding cycles, which act as feedback loops, but this also induces correlations and hence memory modifications. To disentangle the frequencies and the memory design, we studied how the addition of cycles modifies the eigenvalues in the adjacency matrix of the network. Surprisingly, the shape of the eigenvalues is quite beautiful \cite{aceituno2019universal} and can be characterized using random matrix theory tools. Combining this knowledge with our result relating eigenvalues and correlations, we designed an heuristic that tailors reservoirs to specific tasks and showed that it improves upon state of the art \ac{RC} in three different machine learning tasks. Although this idea works in the machine learning version of \ac{RC}, there is one fundamental problem when we try to translate to the world of theoretical neuroscience: the proposed frequency adaptation requires prior knowledge of the task, which might not be plausible in a biological neural network. Therefore the following questions are whether those resonances can emerge by unsupervised learning, and which kind of learning rules would be required. Remarkably, these resonances can be induced by the well-known Spike Time-Dependent Plasticity (STDP) combined with homeostatic mechanisms. We show this by deriving two self-consistent equations: one where the activity of every neuron can be calculated from its synaptic weights and its external inputs and a second one where the synaptic weights can be obtained from the neural activity. By considering spatio-temporal symmetries in our inputs we obtained two families of solutions to those equations where a periodic input is enhanced by the neural network after STDP. This approach shows that periodic and quasiperiodic inputs can induce resonances that agree with the aforementioned \ac{RC} theory. Those results, although rigorous, are expressed on a language of statistical physics and cannot be easily tested or verified in real, scarce data. To make them more accessible to the neuroscience community we showed that latency reduction, a well-known effect of STDP\cite{song2000competitive} which has been experimentally observed \cite{mehta2000experience}, generates neural codes that agree with the self-consistency equations and their solutions. In particular, this analysis shows that metabolic efficiency, synchronization and predictions can emerge from that same phenomena of latency reduction, thus closing the loop with our original machine learning problem. To summarize, this thesis exposes principles of learning recurrent neural networks that are consistent with adaptation in the nervous system and also improve current machine learning methods. This is done by leveraging features of the dynamics of recurrent neural networks such as resonances and correlations in machine learning problems, then imposing the required dynamics into reservoir computing through control theory notions such as feedback loops and spectral analysis. Then we assessed the plausibility of such adaptation in biological networks, deriving solutions from self-organizing processes that are biologically plausible and align with the machine learning prescriptions. Finally, we relate those processes to learning rules in biological neurons, showing how small local adaptations of the spike times can lead to neural codes that are efficient and can be interpreted in machine learning terms.
Published: 2020

42. Maschinelles Lernen zur paarweisen Analyse epithelialer Impedanzspektren

Author: Schindler, Benjamin, Benjamin Schindler, and Universität Leipzig
Subjects: ddc:000, Maschinelles Lernen, Impedanzspektroskopie, Epithelgewebe, epitheliale Impedanzspektroskopie
Abstract: In der modernen Medizin ist die Impedanzspektroskopie eine der wichtigsten Methoden zur Untersuchung von Epithelen, eine von vier Grundgewebearten mehrzelliger Lebewesen. Der Transport von Ionen gehört zu den wichtigsten Funktionen der Epithele und kann durch die Hinzugabe von Wirkstoffen wie Nystatin modifiziert wer- den. Bei einer epithelialen Impedanzspektroskopie wird mithilfe ei- nes sinus-förmigen Wechselstroms die komplex-wertige Impedanz anhand des Ionentransports gemessen. Anhand eines Impedanzspek- trums lassen sich somit physikalische Eigenschaften des Epithels un- ter Verwendung eines Ersatzschaltkreismodells bestimmen. In dieser Arbeit wird ein auf maschinellem Lernen basierendes Ver- fahren zur Analyse epithelialer Impedanzspektren vorgestellt, bei dem die Widerstände und Kapazitäten der apikalen und basolate- ralen Zellmembran und der parazelluläre Widerstand der Zellzwi- schenräume für die Zellline HT29/B6 bestimmt werden. In Form ei- ner paarweisen Analyse werden dabei zwei Impedanzspektren eines Epithels vor und nach der Zugabe des Wirkstoffs Nystatin betrachtet. Unter Verwendung einer Fehlermodellierung und gegebenen Kon- trollbedingungen wird ein Datensatz synthetisiert, bestehend aus Im- pedanzspektren und zugehörigen Parametern des Ersatzschaltkrei- ses. Auf den erzeugten Datensatz werden maschinelle Lernverfahren unterschiedlicher Lernparadigmen angewendet, um die physikali- schen Eigenschaften des modellierten Epithels zu bestimmen. Unter Verwendung verschiedener Merkmalsmengen und Darstel- lungsformen der Impedanz werden Entscheidungsbäume, Random Forests, Multilayer Perceptrons und Support Vector Machines trai- niert. In einem Postprocessing-Schritt werden die erzielten Vorher- sagen mit einem nicht-linearen Least-Squares Ansatz optimiert. Die gesuchten Zielgrößen können somit mit einem durchschnittlichen prozentualen Fehler zwischen ±2% und ±11% bestimmt werden.
Published: 2020
Full Text: View/download PDF

43. Scalable Data Integration for Linked Data

Author: Nentwig, Markus and Universität Leipzig
Subjects: Data Integration, Linked Data, Apache Flink, Incremental Clustering, ddc:000
Abstract: Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster.
Published: 2019

44. Neural Networks for CollaborativeFiltering

Author: Feigl, Josef and Universität Leipzig
Subjects: Recommender systems, Neural networks, Matrix factorization, ddc:000
Abstract: Recommender systems are an integral part of almost all modern e-commerce companies. They contribute significantly to the overall customer satisfaction by helping the user discover new and relevant items, which consequently leads to higher sales and stronger customer retention. It is, therefore, not surprising that large e-commerce shops like Amazon or streaming platforms like Netflix and Spotify even use multiple recommender systems to further increase user engagement. Finding the most relevant items for each user is a difficult task that is critically dependent on the available user feedback information. However, most users typically interact with products only through noisy implicit feedback, such as clicks or purchases, rather than providing explicit information about their preferences, such as product ratings. This usually makes large amounts of behavioural user data necessary to infer accurate user preferences. One popular approach to make the most use of both forms of feedback is called collaborative filtering. Here, the main idea is to compare individual user behaviour with the behaviour of all known users. Although there are many different collaborative filtering techniques, matrix factorization models are among the most successful ones. In contrast, while neural networks are nowadays the state-of-the-art method for tasks such as image recognition or natural language processing, they are still not very popular for collaborative filtering tasks. Therefore, the main focus of this thesis is the derivation of multiple wide neural network architectures to mimic and extend matrix factorization models for various collaborative filtering problems and to gain insights into the connection between these models. The basics of the proposed architecture are wide and shallow feedforward neural networks, which will be established for rating prediction tasks on explicit feedback datasets. These networks consist of large input and output layers, which allow them to capture user and item representation similar to matrix factorization models. By deriving all weight updates and comparing the structure of both models, it is proven that a simplified version of the proposed network can mimic common matrix factorization models: a result that has not been shown, as far as we know, in this form before. Additionally, various extensions are thoroughly evaluated. The new findings of this evaluation can also easily be transferred to other matrix factorization models. This neural network architecture can be extended to be used for personalized ranking tasks on implicit feedback datasets. For these problems, it is necessary to rank products according to individual preferences using only the provided implicit feedback. One of the most successful and influential approaches for personalized ranking tasks is Bayesian Personalized Ranking, which attempts to learn pairwise item rankings and can also be used in combination with matrix factorization models. It is shown, how the introduction of an additional ranking layer forces the network to learn pairwise item rankings. In addition, similarities between this novel neural network architecture and a matrix factorization model trained with Bayesian Personalized Ranking are proven. To the best of our knowledge, this is the first time that these connections have been shown. The state-of-the-art performance of this network is demonstrated in a detailed evaluation. The most comprehensive feedback datasets consist of a mixture of explicit as well as implicit feedback information. Here, the goal is to predict if a user will like an item, similar to rating prediction tasks, even if this user has never given any explicit feedback at all: a problem, that has not been covered by the collaborative filtering literature yet. The network to solve this task is composed out of two networks: one for the explicit and one for the implicit feedback. Additional item features are learned using the implicit feedback, which capture all information necessary to rank items. Afterwards, these features are used to improve the explicit feedback prediction. Both parts of this combined network have different optimization goals, are trained simultaneously and, therefore, influence each other. A detailed evaluation shows that this approach is helpful to improve the network's overall predictive performance especially for ranking metrics.
Published: 2019

45. Comparative Genomics in Distant Taxa: Generating Total Orders of Digraphs

Author: Gärtner, Fabian and Universität Leipzig
Subjects: ddc:000, Graph, Genomic, Supergenome, Superbubble
Published: 2019

46. Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

Author: Retzlaff, Nancy and Universität Leipzig
Subjects: ddc:000, Evolution, Alignments, Comparative Linguistics
Abstract: Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems.
Published: 2019

47. From Best Match Graphs to Gene Trees: A new perspective on graph-based orthology inference

Author: Geiß, Manuela and Universität Leipzig
Subjects: ddc:000, phylogenetics, orthology detection, graph theory, best matches
Abstract: Orthology detection is an important task within the context of genome an- notation, gene nomenclature, and the understanding of gene evolution. With the rapidly accelerating pace at which new genomes become available, highly efficient methods are urgently required. As demonstrated in a large body of literature, reciprocal best match (RBH) methods are reasonably accurate and scale to large data sets. Nevertheless, they are far from perfect and prone to both, false positive and false negative, orthology calls. This work gives a complete characterization of best match as well as reciprocal best match graphs (BMGs and RBMGs) that arise at the first step of RBH methods. While BMGs as well as RBMGs with at most three species can be recognized in polynomial time, RBMGs with more than three species have a surprisingly complicated structure and it remains an open problem whether there exist polynomial time algorithms for the recognition of these RBMGs. In contrast to RBMGs, for which many (often mutually inconsistent) least re- solved trees may exist, there is a unique least resolved tree for BMGs. This tree is a homeomorphic image of the true, but typically unknown, gene tree. Furthermore, in the absence of horizontal gene transfer (HGT), the reciprocal best match graph contains the orthology relation suggesting that RBMGs can only contain false positive but no false negative orthology assignments. Simu- lation scenarios reveal that so-called good quartets, a certain graph pattern on four vertices in BMGs, can be used to successfully identify almost all false pos- itive edges in RBMGs. Together with the existence of a unique least resolved tree, this suggests that BMGs contain a lot of valuable information for orthol- ogy inference that would be lost by exclusively considering RBMGs. These insights motivate to include additional BMG and RBMG editing steps in or- thology detection pipelines based on the presented theoretical insights. Moreover, a workflow is introduced to infer best matches from sequence data by retrieving quartet structures from local information instead of reconstructing the whole gene tree. A crucial prerequisite for this pipeline is the choice of suitable outgroups. However, the empirical simulations also reveal that HGT events cause strong deviations of the orthology relation from the RBMG as well as good quartets that are no longer associated with false positive orthologs, suggesting the need for further investigation of the xenology relation. The directed Fitch’s xenology relation is characterized in terms of forbidden 3-vertex subgraphs and moreover, a polynomial time algorithm for the recog- nition and the reconstruction of a unique least resolved tree is presented. The undirected Fitch relation, in contrast, is shown to be a complete multipartite graph, which does not provide any interesting phylogenetic information. In summary, the results of this work can be used to develop new methods for inferring orthology, paralogy, and HGT. They promise major improvements in the accuracy and the computational performance of RBH-based approaches.
Published: 2019

48. Towards Dynamic Programming on Generalized Data Structures: and Applications of Dynamic Programming in Bioinformatics

Author: Berkemer, Sarah Juliane and Universität Leipzig
Subjects: ddc:000, Dynamic Programming, Alignments, Generalized Data Structures, Sequencing Data
Abstract: Dynamische Programmierung (DP) ist eine Methode um Optimisierungsprobleme zu lösen. Hierbei wird das Problem in sich überlappende Teilprobleme unterteilt und eine optimale Lösung zu jedem der Teilprobleme berechnet. Diese werden dann wiederrum zur Gesamtlösung zusammengesetzt. Teillösungen werden in einer Tabelle gespeichert, sodass jede Teillösung nur einmal berechnet werden muss. So kann ein Suchraum exponentieller Größe in polynomieller Zeit durchsucht und eine optimale Lösung gefunden werden. Die dynamische Programmierung wurde 1952 von Bellman entwickelt und eine der ersten Anwendung war die Detektion von Tippfehlern beim Programmieren. DP Algorithmen werden oft und sehr vielschichtig in der Bioinformatik angewendet wie zum Beispiel beim Vergleich von Gensequenzen, Sequenzalignment genannt, oder der Vorhersage von Molekülstrukturen. Die Menge an Daten und somit auch deren Analyse steigt stetig an, weshalb neue und komplexere Datenstrukturen immer wichtiger werden. Ein Ziel ist es deswegen, DP Algorithmen zu entwickeln, die auf komplexeren Daten- strukturen als Strings angewendet werden können. Durch das Prinzip der algebraischen dynamischen Programmierung (ADP) können DP Algorithmen in kleinere Bestandteile zerlegt werden, die dann unabhängig voneinander weiterentwickelt und abgeändert werden können. Die Arbeit ist in zwei Teile gegliedert, wobei der erste Teil die theoretische Arbeit zur Entwicklung von Algorithmen der dynamischen Programmierung beinhaltet. Hierbei werden zuerst Prinzipien und Definitionen zur dynamischen Programmierung vorgestellt (Kapitel 2), um ein besseres Verständnis der darauffolgenden Kapitel zu gewährleisten. Der zweite Teil der Arbeit zeigt unterschiedliche bioinformatische Anwendungen von DP Algorithmen auf biologische Daten. In einem ersten Kapitel (Kapitel 5) werden Grundsätze biologischer Daten und Algorithmen vorgestellt, die dann in den weiteren Kapiteln benutzt werden.
Published: 2019

49. Formal-Ontological Analysis of the Relationship between Data and Knowledge: A Process-Based Paradigm for Data Science

Author: Siemoleit, Sebastian and Universität Leipzig
Subjects: Artificial Intelligence, Axiomatic Reconstructions, Changing Knowledge, Formal Ontology, Philosophy of Time, Process Philosophy, ddc:000
Published: 2019

50. Understanding Inconsistency -- A Contribution to the Field of Non-monotonic Reasoning

Author: Ulbricht, Markus and Universität Leipzig
Subjects: Inconsistency handling, Non-monotonic reasoning, Measuring inconsistency, ddc:000
Abstract: Conflicting information in an agent's knowledge base may lead to a semantical defect, that is, a situation where it is impossible to draw any plausible conclusion. Finding out the reasons for the observed inconsistency and restoring consistency in a certain minimal way are frequently occurring issues in the research area of knowledge representation and reasoning. In a seminal paper Raymond Reiter proves a duality between maximal consistent subsets of a propositional knowledge base and minimal hitting sets of each minimal conflict -- the famous hitting set duality. We extend Reiter's result to arbitrary non-monotonic logics. To this end, we develop a refined notion of inconsistency, called strong inconsistency. We show that minimal strongly inconsistent subsets play a similar role as minimal inconsistent subsets in propositional logic. In particular, the duality between hitting sets of minimal inconsistent subsets and maximal consistent subsets generalizes to arbitrary logics if the stronger notion of inconsistency is used. We cover various notions of repairs and characterize them using analogous hitting set dualities. Our analysis also includes an investigation of structural properties of knowledge bases with respect to our notions. Minimal inconsistent subsets of knowledge bases in monotonic logics play an important role when investigating the reasons for conflicts and trying to handle them, but also for inconsistency measurement. Our notion of strong inconsistency thus allows us to extend existing results to non-monotonic logics. While measuring inconsistency in propositional logic has been investigated for some time now, taking the non-monotony into account poses new challenges. In order to tackle them, we focus on the structure of minimal strongly inconsistent subsets of a knowledge base. We propose measures based on this notion and investigate their behavior in a non-monotonic setting by revisiting existing rationality postulates, and analyzing the compliance of the proposed measures with these postulates. We provide a series of first results in the context of inconsistency in abstract argumentation theory regarding the two most important reasoning modes, namely credulous as well as skeptical acceptance. Our analysis includes the following problems regarding minimal repairs: existence, verification, computation of one and characterization of all solutions. The latter will be tackled with our previously obtained duality results. Finally, we investigate the complexity of various related reasoning problems and compare our results to existing ones for monotonic logics.
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

379 results on '"Universität Leipzig"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources