Descriptor: "Bioinformatik (beräkningsbiologi)" / Language: english - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bioinformatik (beräkningsbiologi)"' showing total 376 results

Start Over Descriptor "Bioinformatik (beräkningsbiologi)" Language english

376 results on '"Bioinformatik (beräkningsbiologi)"'

1. Covariance properties under natural image transformations for the generalised Gaussian derivative model for visual receptive fields

Author: Lindeberg, Tony
Subjects: vision, Bioinformatics (Computational Biology), Image and Video Processing (eess.IV), Neurosciences, Neuroscience (miscellaneous), receptive field, Galilean covariance, Electrical Engineering and Systems Science - Image and Video Processing, Cellular and Molecular Neuroscience, Datorseende och robotik (autonoma system), FOS: Biological sciences, Quantitative Biology - Neurons and Cognition, image transformations, affine covariance, FOS: Electrical engineering, electronic engineering, information engineering, scale covariance, Bioinformatik (beräkningsbiologi), theoretical neuroscience, Neurons and Cognition (q-bio.NC), primary visual cortex, Neurovetenskaper, Computer Vision and Robotics (Autonomous Systems)
Abstract: This paper presents a theory for how geometric image transformations can be handled by a first layer of linear receptive fields, in terms of true covariance properties, which, in turn, enable geometric invariance properties at higher levels in the visual hierarchy. Specifically, we develop this theory for a generalized Gaussian derivative model for visual receptive fields, which is derived in an axiomatic manner from first principles, that reflect symmetry properties of the environment, complemented by structural assumptions to guarantee internally consistent treatment of image structures over multiple spatio-temporal scales. It is shown how the studied generalized Gaussian derivative model for visual receptive fields obeys true covariance properties under spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations, implying that a vision system, based on image and video measurements in terms of the receptive fields according to this model, can to first order of approximation handle the image and video deformations between multiple views of objects delimited by smooth surfaces, as well as between multiple views of spatio-temporal events, under varying relative motions between the objects and events in the world and the observer. We conclude by describing implications of the presented theory for biological vision, regarding connections between the variabilities of the shapes of biological visual receptive fields and the variabilities of spatial and spatio-temporal image structures under natural image transformations., 38 pages, 14 figures
Published: 2023
Full Text: View/download PDF

2. Computational prediction of cell-cell interactions in the brain-tumour microenvironment

Author: Camargo Romera, Paula
Subjects: Bioinformatics (Computational Biology), Cancer och onkologi, Microenvironment, Bioinformatics and Systems Biology, Bioinformatics, Systems Biology, Interactions, Computational Biology, Brain tumour, Bioinformatik och systembiologi, Cancer and Oncology, Bioinformatik (beräkningsbiologi), Glioblastoma, Cancer
Abstract: Glioblastoma is the fastest-growing, and the most common malignant brain tumour in adults. It is normally treated with surgery and radio- or chemotherapy, but the approximate life expectancy is of 15 months with a high probability of cancer recurring. Therefore, there is a need for decreasing its severity. Bulk and single-cell RNA sequencing allow the identification of cellular states in tumours affected by cell-intrinsic and extrinsic factors. Four different cellular states have been identified in glioblastoma: neural progenitor-like, oligodendrocyte progenitor-like, astrocyte-like, and mesenchymal-like. As glioblastoma is an immunosuppressive tumour, it can alter the immune system and increase the tumour's immune escaping by segregating immunosuppressive factors or interacting with the brain microenvironment.Two datasets were used in this study to explore if the localization of the tumour in the brain microenvironment and the tendency of glioblastomas to activate microglial cells are due to particular ligand-receptor interactions. Data quality control was applied to both datasets and SingleCellSignalR and CellphoneDB packages were used to predict the possible interactions. A total of seven experiments were designed for this study. The first dataset, GBmap, allowed us to do a comparison between tumour cells and microglia, tumour cells and other cell types in the brain, and the four cellular states of glioblastoma with microglia and macrophages. Next, healthy microglia from GBmap was used to compare with the tumour bulk data from the second dataset, HGCC. The bootstrap technique was performed to compare bulk data vs single-cell data, and a comparison between tumour cells and microglia or other cell types was analysed.Results showed specific and shared interactions between cell types or cellular states, revealing the different localization of the tumour cells depends on the expressed ligand-receptor pairs. Also, a total of four patterns of interactions were found in the 50 samples to have a different tendency to activate microglial cells, which are promising results to further explore drugs to interfere with or how these interactions are related to patient survival. Furthermore, even if glioblastoma is a heterogenous disease, more interactions were predicted with microglial/macrophage cells without a uniform pattern between patients, and therefore, this study is a starting point upon which further in vitro studies would be needed to study the predicted interactions as potential targets to stop the progression of this type of cancer.
Published: 2023

3. Intersecting Graph Representation Learning and Cell Profiling : A Novel Approach to Analyzing Complex Biomedical Data

Author: Chamyani, Nima
Subjects: Bioinformatics (Computational Biology), Biological systems, Drug discovery, Graph representation learning, Bioinformatik (beräkningsbiologi), Graph neural networks (GNNs), Network medicine, Cell profiling, Machine learning techniques, Graphs, Biomarkers, Protein-Compound-Pathway interactions
Abstract: In recent biomedical research, graph representation learning and cell profiling techniques have emerged as transformative tools for analyzing high-dimensional biological data. The integration of these methods, as investigated in this study, has facilitated an enhanced understanding of complex biological systems, consequently improving drug discovery. The research aimed to decipher connections between chemical structures and cellular phenotypes while incorporating other biological information like proteins and pathways into the workflow. To achieve this, machine learning models' efficacy was examined for classification and regression tasks. The newly proposed graph-level and bio-graph integrative predictors were compared with traditional models. Results demonstrated their potential, particularly in classification tasks. Moreover, the topology of the COVID-19 BioGraph was analyzed, revealing the complex interconnections between chemicals, proteins, and biological pathways. By combining network analysis, graph representation learning, and statistical methods, the study was able to predict active chemical combinations within inactive compounds, thereby exhibiting significant potential for further investigations. Graph-based generative models were also used for molecule generation opening up further research avenues in finding lead compounds. In conclusion, this study underlines the potential of combining graph representation learning and cell profiling techniques in advancing biomedical research in drug repurposing and drug combination. This integration provides a better understanding of complex biological systems, assists in identifying therapeutic targets, and contributes to optimizing molecule generation for drug discovery. Future investigations should optimize these models and validate the drug combination discovery approach. As these techniques continue to evolve, they hold the potential to significantly impact the future of drug screening, drug repurposing, and drug combinations.
Published: 2023

4. Orientation selectivity of affine Gaussian derivative based receptive fields

Author: Lindeberg, Tony
Subjects: Orientation selectivity, Complex cell, Affine covariance, Bioinformatics (Computational Biology), Gaussian derivative, Vision, Quasi quadrature, Bioinformatik (beräkningsbiologi), Simple cell, Theoretical neuroscience, Receptive field
Abstract: This paper presents a theoretical analysis of the orientation selectivity of simple and complex cells that can be well modelled by the generalized Gaussian derivative model for visual receptive fields, with the purely spatial component of the receptive fields determined by oriented affine Gaussian derivatives for different orders of spatial differentiation. A detailed mathematical analysis is presented for the three different cases of either: (i) purely spatial receptive fields, (ii) space-time separable spatio-temporal receptive fields and (iii) velocity-adapted spatio-temporal receptive fields. Closed-form theoretical expressions for the orientation selectivity curves for idealized models of simple and complex cells are derived for all these main cases, and it is shown that the degree of orientation selectivity of the receptive fields increases with a scale parameter ratio $\kappa$, defined as the ratio between the scale parameters in the directions perpendicular to vs. parallel with the preferred orientation of the receptive field. It is also shown that the degree of orientation selectivity increases with the order of spatial differentiation in the underlying affine Gaussian derivative operators over the spatial domain. We conclude by describing biological implications of the derived theoretical results, demonstrating that the predictions from the presented theory are consistent with previously established biological results concerning broad vs. sharp orientation tuning of visual neurons in the primary visual cortex. We also demonstrate that the above theoretical predictions, in combination with these biological results, are consistent with a previously formulated biological hypothesis, stating that the biological receptive field shapes should span the degrees of freedom in affine image transformations, to support affine covariance over the population of receptive fields in the primary visual cortex. QC 20230425 Covariant and invariant deep networks
Published: 2023

5. The revolutionary partnership of computation and biology

Author: Rivas-Carrillo, Salvador Daniel
Subjects: Bioinformatics (Computational Biology), Annan medicinsk grundvetenskap, Bioinformatik (beräkningsbiologi), Other Basic Medicine
Abstract: The organization of living beings is complex. Science uses modeling in order to gain a deeper understanding, and to be able to manipulate the processes of living organisms. To this purpose, I used and developed computational tools to investigate and model different relevant biological phenomena. In paper I, I utilized whole-genome data from wild and domesticated European rabbit (Oryctolagus cuniculus sp.) populations to identify segregating insertions of endogenous retroviruses and compare their variation along the host phylogeny and domestication history. The results from this study highlight the importance of genomic modeling beyond reference organisms and reference individuals, and provide deep insights regarding strategies for variant analyses in host population comparative genomics. In paper IV, I studied the process of exaptation of foreign genetic elements at broad-scale by observing the presence and characteristics of retroviral env gene, syncytin, across vertebrates. I searched a library of more than 150 chromosome-length assemblies covering 17 taxonomical orders for syncytin homologs, where I identified and syntenically aligned over 300 loci insertions, including not previously known insertions. Additionally, three-dimensional structures of the recovered sequences were predicted using AlphaFold2. Phylogenomics analyses suggest a complex dynamic of multiple retroviral insertions at different time points with sequence conservation specific to clades that share a similar histo-physiological placental type. In paper II, I expanded the scope to encompass translational medicine by developing an unsupervised machine learning methodology for detecting anomalies in biomedical signals, MindReader, which I applied primarily to electroencephalogram. In paper III, I developed a hidden Markov model implementation that includes a hypothesis generator for stream time-domain signals, which is used as a dependency for paper II. The work in this thesis substantiates that a combination of biological knowledge, cutting-edge technology, and robust algorithmic design constitute the primordial factors for scientific advancement.
Published: 2023

6. Predicting Dementia Risk Factors Based on Feature Selection and Neural Networks

Author: Javeed, Ashir, Moraes, Ana Luiza Dallora, Sanmartin Berglund, Johan, Ali, Arif, Anderberg, Peter, and Ali, Liaqat
Subjects: Neurologi, Geriatrik, Cognitive decline, Novel methods, Detection/identification, Societal impacts, Features selection, feature selection, Deep neural networks, Diagnosis, genetic algorithm, Bioinformatics (Computational Biology), Learning systems, Neurodegenerative diseases, Neurosciences, Genetic algorithms, neural networks, Risk factors, Neurology, Geriatrics, Risk Identification, Neural-networks, Bioinformatik (beräkningsbiologi), Prediction modelling, Dementia prediction, Neurovetenskaper, Forecasting
Abstract: Dementia is a disorder with high societal impact and severe consequences for its patients who suffer from a progressive cognitive decline that leads to increased morbidity, mortality, and disabilities. Since there is a consensus that dementia is a multifactorial disorder, which portrays changes in the brain of the affected individual as early as 15 years before its onset, prediction models that aim at its early detection and risk identification should consider these characteristics. This study aims at presenting a novel method for ten years prediction of dementia using on multifactorial data, which comprised 75 variables. There are two automated diagnostic systems developed that use genetic algorithms for feature selection, while artificial neural network and deep neural network are used for dementia classification. The proposed model based on genetic algorithm and deep neural network had achieved the best accuracy of 93.36%, sensitivity of 93.15%, specificity of 91.59%, MCC of 0.4788, and performed superior to other 11 machine learning techniques which were presented in the past for dementia prediction. The identified best predictors were: age, past smoking habit, history of infarct, depression, hip fracture, single leg standing test with right leg, score in the physical component summary and history of TIA/RIND. The identification of risk factors is imperative in the dementia research as an effort to prevent or delay its onset. CC BY 4.0© 2023 Tech Science Press. All rights reserved.Corresponding Author: Johan Sanmartin Berglund. Email: johan.sanmartin.berglund@bth.seThe authors received no specific funding for this study.
Published: 2023

7. Predicting tumour growth-driving interactions from transcriptomic data using machine learning

Author: Stigenberg, Mathilda
Subjects: immunology, Bioinformatics (Computational Biology), breast cancer, machine learning, spatial transcriptomics, Bioinformatik (beräkningsbiologi), cancer, deep learning, variational autoencoder, cell-cell interactions, tumour microenvironment, single-cell RNA sequencing, supervised variational autoencoder
Abstract: The mortality rate is high for cancer patients and treatments are only efficient in a fraction of patients. To be able to cure more patients, new treatments need to be invented. Immunotherapy activates the immune system to fight against cancer and one treatment targets immune checkpoints. If more targets are found, more patients can be treated successfully. In this project, interactions between immune and cancer cells that drive tumour growth were investigated in an attempt to find new potential targets. This was achieved by creating a machine learning model that finds genes expressed in cells involved in tumour-driving interactions. Single-cell RNA sequencing and spatial transcriptomic data from breast cancer patients were utilised as well as single-cell RNA sequencing data from healthy patients. The tumour rate was based on the cumulative expression of G2/M genes. The G2/M related genes were excluded from the analysis since these were assumed to be cell cycle genes. The machine learning model was based on a supervised variational autoencoder architecture. By using this kind of architecture, it was possible to compress the input into a low dimensional space of genes, called a latent space, which was able to explain the tumour rate. Optuna hyperparameter optimizer framework was utilised to find the best combination of hyperparameters for the model. The model had a R2 score of 0.93, which indicated that the latent space was able to explain the growth rate 93% accurately. The latent space consisted of 20 variables. To find out which genes that were in this latent space, the correlation between each latent variable and each gene was calculated. The genes that were positively correlated or negatively correlated were assumed to be in the latent space and therefore involved in explaining tumour growth. Furthermore, the correlation between each latent variable and the growth rate was calculated. The up- and downregulated genes in each latent variable were kept and used for finding out the pathways for the different latent variables. Five of these latent variables were involved in immune responses and therefore these were further investigated. The genes in these five latent variables were mapped to cell types. One of these latent variables had upregulated immune response for positively correlated growth, indicating that immune cells were involved in promoting cancer progression. Another latent variable had downregulated immune response for negatively correlated growth. This indicated that if these genes would be upregulated instead, the tumour would be thriving. The genes found in these latent variables were analysed further. CD80, CSF1, CSF1R, IL26, IL7, IL34 and the protein NF-kappa-B were interesting finds and are known immune-modulators. These could possibly be used as markers for pro-tumour immunity. Furthermore, CSF1, CSF1R, IL26, IL34 and the protein NF-kappa-B could potentially be targeted in immunotherapy.
Published: 2023

8. Developing a highly accurate, locally interpretable neural network for medical image analysis

Author: Ventura Caballero, Rony David
Subjects: Bioinformatics (Computational Biology), Chest radiograph, XAI, Bioinformatik (beräkningsbiologi), Interpretability, Computer vision, Pediatric pneumonia
Abstract: Background Machine learning techniques, such as convolutional networks, have shown promise in medical image analysis, including the detection of pediatric pneumonia. However, the interpretability of these models is often lacking, compromising their trustworthiness and acceptance in medical applications. The interpretability of machine learning models in medical applications is crucial for trust and bias identification. Aim The aim is to create a locally interpretable neural network that performs comparably to black-box models while being inherently interpretable, enhancing trust in medical machine learning models. Method An MLP ReLU network is trained with Guangzhou Women and Children's Medical Center pediatric chest x-ray image dataset and utilize Aletheia unwrapper for interpretability. A 5-fold cross-validation assesses the network's performance, measuring accuracy and F1 score. The average accuracy and F1 score are 0.90 and 0.91, respectively. To assessthe interpretability results are compared against a CNN network aided with LIME and SHAP to generate explanations. Results Despite lacking convolutional layers, the MLP network satisfactorily categorizes pneumonia images and explanations align with relevant areas of interest from previous studies. Moreover, by comparing it with a state of the art network aided with LIME and SHAP explanations, the local explanations demonstrate to be consistent within areas of the lungs while the post-hoc alternatives often highlighted areas not relevant for the specific task. Conclusion The developed locally interpretable neural network demonstrates promising performance and interpretability. However, additional research and implementation are required for it to outperform the so-called black box models. In a medical setting, a more accurate model despite the score could be crucial, as it could potentially save more lives, which is the ultimate goal of healthcare.
Published: 2023

9. Improved computations for relationship inference using low-coverage sequencing data

Author: Petter Mostad, Andreas Tillmar, and Daniel Kling
Subjects: Bioinformatics (Computational Biology), Structural Biology, Applied Mathematics, LcNGS, Pedigree inference, Bayesian, Bioinformatik (beräkningsbiologi), Molecular Biology, Biochemistry, Computer Science Applications
Abstract: Pedigree inference, for example determining whether two persons are second cousins or unrelated, can be done by comparing their genotypes at a selection of genetic markers. When the data for one or more of the persons is from low-coverage next generation sequencing (lcNGS), currently available computational methods either ignore genetic linkage or do not take advantage of the probabilistic nature of lcNGS data, relying instead on first estimating the genotype. We provide a method and software (see familias.name/lcNGS) bridging the above gap. Simulations indicate how our results are considerably more accurate compared to some previously available alternatives. Our method, utilizing a version of the Lander-Green algorithm, uses a group of symmetries to speed up calculations. This group may be of further interest in other calculations involving linked loci. Funding Agencies|Chalmers University of Technology
Published: 2023

10. Combining Cell Painting, Gene Expression and Structure-Activity Data for Mechanism of Action Prediction

Author: Everett Palm, Erik
Subjects: Bioinformatics (Computational Biology), machine learning, tabular data, image data, Bioinformatik (beräkningsbiologi), deep learning, bioinformatics, joint model
Abstract: The rapid progress in high-throughput omics methods and high-resolution morphological profiling, coupled with the significant advances in machine learning (ML) and deep learning (DL), has opened new avenues for tackling the notoriously difficult problem of predicting the Mechanism of Action (MoA) for a drug of clinical interest. Understanding a drug's MoA can enrich our knowledge of its biological activity, shed light on potential side effects, and serve as a predictor of clinical success. This project aimed to examine whether incorporating gene expression data from LINCS L1000 public repository into a joint model previously developed by Tian et al. (2022), which combined chemical structure and morphological profiles derived from Cell Painting, would have a synergistic effect on the model's ability to classify chemical compounds into ten well-represented MoA classes. To do this, I explored the gene expression dataset to assess its quality, volume, and limitations. I applied a variety of ML and DL methods to identify the optimal single model for MoA classification using gene expression data, with a particular emphasis on transforming tabular data into image data to harness the power of convolutional neural networks. To capitalize on the complementary information stored in different modalities, I tested end-to-end integration and soft-voting on sets of joint models across five stratified data splits. The gene expression dataset was relatively low in quality, with many uncontrollable factors that complicated MoA prediction. The highest-performing gene expression model was a one-dimensional convolutional neural network, with an average macro F1 score of 0.40877 and a standard deviation of 0.034. Approaches converting tabular data into image data did not significantly outperform other methods. Combining optimized single models resulted in a performance decline compared to the best single model in the combination. To take full advantage of algorithmic developments in drug development and high-throughput multi-omics data, my project underscores the need for standardizing data generation and optimizing data fusion methods.
Published: 2023

11. Finding Genotype-Phenotype Correlations in Norway Spruce - A Genome-Wide Association Study using Machine Learning

Author: Sandberg, Matilda
Subjects: Machine Learning, Bioinformatics (Computational Biology), Naturvetenskap, Bioinformatik (beräkningsbiologi), Natural Sciences, Genome-Wide Association Study
Abstract: The Norway spruce is of great importance from both an ecological- and economic standpoint. Information about which genes that causes certain phenotypic traits in the species is therefore highly valuable. The purpose of this project was to apply machine learning to find such genotype-phenotype correlations. The purpose was also to compare the results from different machine learning algorithms to a more traditional linear mixed model GWAS (where correlation to the phenotype is estimated for each SNP one by one) to find which is the better method for GWAS. The machine learning algorithms tested were decision tree, support vector machine and support vector regression. The phenotypes analyzed were wood density and initiation frequency of zygotic embryogenesis (ZE). The latter is related to a new method for cloning. The genetic data consisted of single-nucleotide polymorphisms (SNPs). Due to the large genome size of Norway spruce and due to limitations in the packages used in R two different approaches were taken to reduce the sample size. The first approach used Kendall’s rank correlation coefficient to remove redundant SNPs and the second used an iterative approach to the machine learning model. The iterative approach was proven to be the best and support vector machine/regression was found to be better than decision tree for both phenotypes. Support vector regression from the iterative approach resulted in a squared correlation coefficient of 0.83 for density and 0.94 for ZE initiation frequency. Note that these very high values should be interpreted with caution, as it is possible that some of the significant correlations are only due to random chance. Even a small chance for random correlations will result in findings when the number of SNPs are this large (1908552 SNPs). The significant SNPs identified by the machine learning models were compared to SNPs identified by the linear mixed model GWAS. This indeed showed some overlaps of significant SNPs, which increases the credibility of my results. However, further investigation of the identified significant SNPs is needed to determine their functional mode of action. My conclusion is that using machine learning to predict phenotypic traits from SNP data can be a good choice. However, the model might not use all correlated SNPs, just enough to get a good prediction. Therefore, for the purpose of finding significant SNPs, the linear mixed model approach might be better. In other words, the method used should be determined by the purpose of the study.
Published: 2023

12. Learning-based prediction, representation, and multimodal registration for bioimage processing

Author: Pielawski, Nicolas
Subjects: Bioinformatics (Computational Biology), Bioimage processing, Bioinformatik (beräkningsbiologi), Deep learning, Multimodal image registration, bayesian optimization
Abstract: Microscopy and imaging are essential to understanding and exploring biology. Modern staining and imaging techniques generate large amounts of data resulting in the need for automated analysis approaches. Many earlier approaches relied on handcrafted feature extractors, while today's deep-learning-based methods open up new ways to analyze data automatically. Deep learning has become popular in bioimage processing as it can extract high-level features describing image content (Paper III). The work in this thesis explores various aspects and limitations of machine learning and deep learning with applications in biology. Learning-based methods have generalization issues on out-of-distribution data points, and methods such as uncertainty estimation (Paper II) and visual quality control (Paper V) can provide ways to mitigate those issues. Furthermore, deep learning methods often require large amounts of data during training. Here the focus is on optimizing deep learning methods to meet current computational capabilities and handle the increasing volume and size of data (Paper I). Model uncertainty and data augmentation techniques are also explored (Papers II and III). This thesis is split into chapters describing the main components of cell biology, microscopy imaging, and the mathematical and machine-learning theories to give readers an introduction to biomedical image processing. The main contributions of this thesis are deep-learning methods for reconstructing patch-based segmentation (Paper I) and pixel regression of traction force images (Paper II), followed by methods for aligning images from different sensors in a common coordinate system (named multimodal image registration) using representation learning (Paper III) and Bayesian optimization (Paper IV). Finally, the thesis introduces TissUUmaps 3, a tool for visualizing multiplexed spatial transcriptomics data (Paper V). These contributions provide methods and tools detailing how to apply mathematical frameworks and machine-learning theory to biology, giving us concrete tools to improve our understanding of complex biological processes.
Published: 2023

13. Molekylär mångsysslare : komplexiteten kring H3K36 metylering

Author: Lindehell, Henrik
Subjects: Set2, Bioinformatics (Computational Biology), epigenetics, histone modifications, Histone 3.3, H3K36, Biochemistry and Molecular Biology, NSD, PIWI/piRNA biosynthesis, chromosome-specific gene regulation, Ash1, dosage compensation, post-translational modifications, Genetics, Bioinformatik (beräkningsbiologi), Drosophila, histone methylation, transposable elements, Genetik, proximity ligation assay, Biokemi och molekylärbiologi
Abstract: Post-translational modifications of histones enable differential transcriptional control of the genome between cell types and developmental stages, and in response to environmental factors. The methylation of Histone 3 Lysine 36 (H3K36) is one the most complex and well-studied histone modifications and is known to be involved in a wide range of molecular processes. Commonly associated with active genes and transcriptional elongation, H3K36 methylation also plays a key role in DNA repair, repression of cryptic transcription, and guiding additional post-translational modifications to histones, genomic DNA, and RNA. In Drosophila melanogaster, trimethylated H3K36 has also been linked to dosage compensation of the single male X chromosome as a binding substrate for the Male-Specific Lethal (MSL) complex. However, this model has been challenged by structural and biochemical studies demonstrating higher MSL complex affinity for other methylated lysines. There is an additional system of chromosome-specific gene regulation in D. melanogaster where transcription from the small heterochromatic fourth chromosome is increased by Painting of fourth (POF), a protein specifically binding nascent RNA on the fourth chromosome. The fourth chromosome is thought to have been an ancestral X chromosome that reverted into an autosome. POF mediating high transcription levels from an autosome is believed to be a remnant of an ancient sex-chromosome dosage compensation mechanism. Proximity ligation assays revealed no interaction between MSL complex components and methylated H3K36. This finding was corroborated by RNA sequencing of H3K36 methylation impaired mutants: the transcriptional output of the male X chromosome was unaffected in mutants where Lysine 36 on Histone 3 was replaced by an Arginine, abolishing methylation of this site. However, we found that knocking out Set2, which encodes the methyltransferase responsible for H3K36 trimethylation, significantly reduced X-linked transcription relative to autosomal transcription. This strongly suggests the existence of previously unrecognized alternate Set2 substrates. Interestingly, we also found that Ash1- and NSD-mediated methylation of H3K36 was required to maintain high expression from chromosome four. Recent studies have also implicated H3K36 methylation in the silencing of transposon activity in somatic cells. By analyzing the transcription of transposable elements and Piwi-interacting RNAs (piRNAs), we identified dimethylation of H3K36 by Set2 as the main methylation mark involved in this process and showed that dual-stranded piRNA clusters are preferentially activated upon disturbing the methylation machinery. These findings extends the long list of processes dependent on functional H3K36 methylation.
Published: 2023

14. Discovery of Chemical Probes through Structure-based Virtual Screening of Vast Compound Databases

Author: Luttens, Andreas
Subjects: Bioinformatics (Computational Biology), Bioinformatik (beräkningsbiologi)
Abstract: Bioactive molecules have traditionally been discovered through labor-intensive screening methods in which individual compounds are tested against specific protein targets or cells to identify those that produce the desired biological effect. However, these approaches have significant limitations. Firstly, the number of molecules that can be tested in a standard laboratory is restricted, and the acquisition and curation of these compounds come at a high cost. Secondly, these methods are time-consuming because each compound must be tested individually, and they are confined to small libraries with very limited chemical space coverage. In contrast, structure-based virtual screening can rapidly predict a molecule's interaction with a target protein, allowing for the evaluation of enormous libraries of chemical substances. Furthermore, this approach is not restricted to physically available molecules and can be extended to virtual compounds. Commercial chemical space has recently grown exponentially and currently contains several billion molecules that can be readily synthesized and delivered for experimental testing within weeks. Despite the enormous potential of these databases for drug discovery, they also pose new challenges, and development of effective strategies is required to explore ultralarge libraries. The goal of this thesis was to develop and apply novel strategies focused on exploring the potential of ultralarge chemical libraries using structure-based virtual screening. Publication I summarizes best practices on large-scale virtual screening and benchmarking protocols for molecular docking calculations. Publication II describes a docking screen of several hundred million lead-like molecules against the SARS-CoV-2 main protease, leading to promising starting points for development of coronavirus inhibitors. The binding modes predicted by docking were confirmed experimentally by X-ray crystallography. After several rounds of optimization, nanomolar broad-spectrum inhibitors with antiviral effects against coronaviruses in cell models were discovered. Manuscript III demonstrates how machine learning can be used to accelerate virtual screening campaigns. Classification models were trained on docking scores to identify promising molecules in ultralarge libraries relevant to the protein target of interest. The classification algorithms were able to reduce a multi-billion-scale library to a subset of high-confidence candidates with improved docking scores. Manuscript IV focuses on large-scale fragment docking to identify compounds binding to 8-oxoguanine glycosylase 1 and how to efficiently optimize them to potent inhibitors. The docking scoring function was able to correctly predict binding modes of the experimental hits and optimization led to submicromolar inhibitors with anti-inflammatory and anti-cancer effects in cell models. Publication V presents how docking of tailored virtual libraries of nature-inspired macrocycles led to potent disruptors of the KEAP1-Nrf2 complex. The results of this thesis highlight that large-scale virtual screening is a resourceful tool to discover ligands of a wide variety of drug targets.
Published: 2023

15. Compressing network populations with modal networks reveal structural diversity

Author: Kirkley, Alec, Rojas, Alexis, Rosvall, Martin, and Young, Jean-Gabriel
Subjects: Bioinformatics (Computational Biology), Bioinformatik (beräkningsbiologi)
Abstract: Analyzing relational data consisting of multiple samples or layers involves critical challenges: How many networks are required to capture the variety of structures in the data? And what are the structures of these representative networks? We describe efficient nonparametric methods derived from the minimum description length principle to construct the network representations automatically. The methods input a population of networks or a multilayer network measured on a fixed set of nodes and output a small set of representative networks together with an assignment of each network sample or layer to one of the representative networks. We identify the representative networks and assign network samples to them with an efficient Monte Carlo scheme that minimizes our description length objective. For temporally ordered networks, we use a polynomial time dynamic programming approach that restricts the clusters of network layers to be temporally contiguous. These methods recover planted heterogeneity in synthetic network populations and identify essential structural heterogeneities in global trade and fossil record networks. Our methods are principled, scalable, parameter-free, and accommodate a wide range of data, providing a unified lens for exploratory analyses and preprocessing large sets of network samples.
Published: 2023

16. Decision Support System for Predicting Mortality in Cardiac Patients Based on Machine Learning

Author: Ashir Javeed, Muhammad Asim Saleem, Ana Luiza Dallora, Liaqat Ali, Johan Sanmartin Berglund, and Peter Anderberg
Subjects: Fluid Flow and Transfer Processes, Bioinformatics (Computational Biology), Kardiologi, Process Chemistry and Technology, General Engineering, heart morality, Public Health, Global Health, Social Medicine and Epidemiology, feature ranking, Computer Science Applications, Folkhälsovetenskap, global hälsa, socialmedicin och epidemiologi, Datorsystem, imbalance classes, Computer Systems, Bioinformatik (beräkningsbiologi), General Materials Science, Cardiac and Cardiovascular Systems, Instrumentation, random forest
Abstract: Researchers have proposed several automated diagnostic systems based on machine learning and data mining techniques to predict heart failure. However, researchers have not paid close attention to predicting cardiac patient mortality. We developed a clinical decision support system for predicting mortality in cardiac patients to address this problem. The dataset collected for the experimental purposes of the proposed model consisted of 55 features with a total of 368 samples. We found that the classes in the dataset were highly imbalanced. To avoid the problem of bias in the machine learning model, we used the synthetic minority oversampling technique (SMOTE). After balancing the classes in the dataset, the newly proposed system employed a (Formula presented.) statistical model to rank the features from the dataset. The highest-ranked features were fed into an optimized random forest (RF) model for classification. The hyperparameters of the RF classifier were optimized using a grid search algorithm. The performance of the newly proposed model ((Formula presented.) _RF) was validated using several evaluation measures, including accuracy, sensitivity, specificity, F1 score, and a receiver operating characteristic (ROC) curve. With only 10 features from the dataset, the proposed model (Formula presented.) _RF achieved the highest accuracy of 94.59%. The proposed model (Formula presented.) _RF improved the performance of the standard RF model by 5.5%. Moreover, the proposed model (Formula presented.) _RF was compared with other state-of-the-art machine learning models. The experimental results show that the newly proposed decision support system outperforms the other machine learning systems using the same feature selection module ((Formula presented.)). CC BY 4.0© 2023 by the authors.(This article belongs to the Special Issue Opinion Mining and Sentiment Analysis Using Deep Neural Network)This research received no external funding.
Published: 2023

17. Machine Learning methods in shotgun proteomics

Author: Truong, Patrick
Subjects: benchmark mathematical methods, Bioinformatics (Computational Biology), proteomics, computational proteomics, probabilistic modelling, ms2 intensity, Bioinformatik (beräkningsbiologi), transformers, bioinformatics, mass spectrometry protein summarization Bayesian hierarchical modelling label-free quantification data-independent acquisition mass spectrometry, bert
Abstract: As high-throughput biology experiments generate increasing amounts of data, the field is naturally turning to data-driven methods for the analysis and extraction of novel insights. These insights into biological systems are crucial for understanding disease progression, drug targets, treatment development, and diagnostics methods, ultimately leading to improving human health and well-being, as well as, deeper insight into cellular biology. Biological data sources such as the genome, transcriptome, proteome, metabolome, and metagenome provide critical information about biological system structure, function, and dynamics. The focus of this licentiate thesis is on proteomics, the study of proteins, which is a natural starting point for understanding biological functions as proteins are crucial functional components of cells. Proteins play a crucial role in enzymatic reactions, structural support, transport, storage, cell signaling, and immune system function. In addition, proteomics has vast data repositories and technical and methodological improvements are continually being made to yield even more data. However, generating proteomic data involves multiple steps, which are prone to errors, making sophisticated models essential to handle technical and biological artifacts and account for uncertainty in the data. In this licentiate thesis, the use of machine learning and probabilistic methods to extract information from mass-spectrometry-based proteomic data is investigated. The thesis starts with an introduction to proteomics, including a basic biological background, followed by a description of how massspectrometry-based proteomics experiments are performed, and challenges in proteomic data analysis. The statistics of proteomic data analysis are also explored, and state-of-the-art software and tools related to each step of the proteomics data analysis pipeline are presented. The thesis concludes with a discussion of future work and the presentation of two original research works. The first research work focuses on adapting Triqler, a probabilistic graphical model for protein quantification developed for data-dependent acquisition (DDA) data, to data-independent acquisition (DIA) data. Challenges in this study included verifying that DIA data conformed with the model used in Triqler, addressing benchmarking issues, and modifying the missing value model used by Triqler to adapt for DIA data. The study showed that DIA data conformed with the properties required by Triqler, implemented a protein inference harmonization strategy, and modified the missing value model to adapt for DIA data. The study concluded by showing that Triqler outperformed current protein quantification techniques. The second research work focused on developing a novel deep-learning based MS2-intensity predictor by incorporating the self-attention mechanism called transformer into Prosit, an established Recurrent Neural Networks (RNN) based deep learning framework for MS2 spectrum intensity prediction. RNNs are a type of neural network that can efficiently process sequential data by capturing information from previous steps, in a sequential manner. The transformer self-attention mechanism allows a model to focus on different parts of its input sequence during processing independently, enabling it to capture dependencies and relationships between elements more effectively. The transformers therefore remedy some of the drawbacks of RNNs, as such, we hypothesized that the implementation of MS2-intensity predictor using transformers rather than RNN would improve its performance. Hence, Prosit-transformer was developed, and the study showed that the model training time and the similarity between the predicted MS2 spectrum and the observed spectrum improved. These original research works address various challenges in computational proteomics and contribute to the development of data-driven life science. Allteftersom high-throughput experiment genererar allt större mängder data vänder sig området naturligt till data-drivna metoder för analys och extrahering av nya insikter. Dessa insikter om biologiska system är avgörande för att förstå sjukdomsprogression, läkemedelspåverkan, behandlingsutveckling, och diagnostiska metoder, vilket i slutändan leder till en förbättring av människors hälsa och välbefinnande, såväl som en djupare förståelse av cell biologi. Biologiska datakällor som genomet, transkriptomet, proteomet, metabolomet och metagenomet ger kritisk information om biologiska systems struktur, funktion och dynamik. I licentiatuppsats fokusområde ligger på proteomik, studiet av proteiner, vilket är en naturlig startpunkt för att förstå biologiska funktioner eftersom proteiner är avgörande funktionella komponenter i celler. Dessa proteiner spelar en avgörande roll i enzymatiska reaktioner, strukturellt stöd, transport, lagring, cellsignalering och immunsystemfunktion. Dessutom har proteomik har stora dataarkiv och tekniska samt metodologiska förbättringar görs kontinuerligt för att ge ännu mer data. Men för att generera proteomisk data krävs flera steg, som är felbenägna, vilket gör att sofistikerade modeller är väsentliga för att hantera tekniska och biologiska artefakter och för att ta hänsyn till osäkerhet i data. I denna licentiatuppsats undersöks användningen av maskininlärning och probabilistiska metoder för att extrahera information från masspektrometribaserade proteomikdata. Avhandlingen börjar med en introduktion till proteomik, inklusive en grundläggande biologisk bakgrund, följt av en beskrivning av hur masspektrometri-baserade proteomikexperiment utförs och utmaningar i proteomisk dataanalys. Statistiska metoder för proteomisk dataanalys utforskas också, och state-of-the-art mjukvara och verktyg som är relaterade till varje steg i proteomikdataanalyspipelinen presenteras. Avhandlingen avslutas med en diskussion om framtida arbete och presentationen av två original forskningsarbeten. Det första forskningsarbetet fokuserar på att anpassa Triqler, en probabilistisk grafisk modell för proteinkvantifiering som utvecklats för datadependent acquisition (DDA) data, till data-independent acquisition (DIA) data. Utmaningarna i denna studie inkluderade att verifiera att DIA-datas egenskaper överensstämde med modellen som användes i Triqler, att hantera benchmarking-frågor och att modifiera missing-value modellen som användes av Triqler till DIA-data. Studien visade att DIA-data överensstämde med de egenskaper som krävdes av Triqler, implementerade en proteininferensharmoniseringsstrategi och modifierade missing-value modellen till DIA-data. Studien avslutades med att visa att Triqler överträffade nuvarande state-of-the-art proteinkvantifieringsmetoder. Det andra forskningsarbetet fokuserade på utvecklingen av en djupinlärningsbaserad MS2-intensitetsprediktor genom att inkorporera self-attention mekanismen som kallas för transformer till Prosit, en etablerad Recurrent Neural Network (RNN) baserad djupinlärningsramverk för MS2 spektrum intensitetsprediktion. RNN är en typ av neurala nätverk som effektivt kan bearbeta sekventiell data genom att bevara och använda dolda tillstånd som fångar information från tidigare steg på ett sekventiellt sätt. Självuppmärksamhetsmekanismen i transformer tillåter modellen att fokusera på olika delar av sekventiellt data samtidigt under bearbetningen oberoende av varandra, vilket gör det möjligt att fånga relationer mellan elementen mer effektivt. Genom detta lyckas Transformer åtgärda vissa nackdelar med RNN, och därför hypotiserade vi att en implementation av en ny MS2-intensitetprediktor med transformers istället för RNN skulle förbättra prestandan. Därmed konstruerades Prosit-transformer, och studien visade att både modellträningstiden och likheten mellan predicerat MS2-spektrum och observerat spektrum förbättrades. Dessa originalforskningsarbeten hanterar olika utmaningar inom beräkningsproteomik och bidrar till utvecklingen av datadriven livsvetenskap. QC 2023-05-22
Published: 2023

18. Predicting morphological effect of compounds on COVID-19 infected cells

Author: Öhrner, Viktor
Subjects: Bioinformatics (Computational Biology), QSAR, AI, Bioinformatik (beräkningsbiologi), COVID-19, bioinformatics, morphological profiles, ML
Abstract: The cost of developing new drugs is high and the aim of computer-assisted drug discovery is to reduce that development cost, either through virtual screening or generating novel compounds. System biology is one approach to drug discovery where the response of a biological system is the subject of study, instead of drug target interaction. One way to observe a biological system is through microscopy images that are taken of cells perturbed with compounds. Image software extracts information called morphological profiles from the images that can be used for data hungry models. One of the ways artificial intelligence has been applied to drug discovery is with generative models that can generate new compounds. One such generative model is reinforcement learning that employs a critic to guide the generation of compounds towards desirable behaviors. In this study different machine learning models were tested if they could predict the morphological response of COVID-19 infected cells to compounds from their structure. No modells showed any promising results. The reason that no model performed well was because of the dataset. There is a lot of variance in the dataset, meaning that the response to the same compound varies. There was also a lot of difference between the compounds in the dataset, meaning that any representation that the model learns does not transfer over to other compounds. The data set was also imbalanced with more inactive compounds.
Published: 2023

19. Developing a reproducible bioinformatics workflow for canine inherited retinal disease

Author: Martin, Melina Toni Marie
Subjects: CanFam4, Nextflow, Bioinformatics (Computational Biology), whole-genome sequencing, dog, Bioinformatik (beräkningsbiologi), Inherited Retinal Degeneration, bioinformatics
Abstract: Inherited Retinal Degenerations (IRDs) are a heterogenous group of diseases which lead to vision impairment and can be found both in humans and in dogs. About 1 in 1,380 humans is estimated to suffer from an autosomal recessive IRD, which would be 5.5 million people worldwide, and many more are estimated to be unaffected carriers. This makes autosomal recessive IRDs likely the most common group of Mendelian diseases in humans. Today, about 300 genetic mutations have been connected to cause retinal diseases in humans. Whilst in dogs only 32 genes have been identified, numerous eye conditions have been described where the genetic cause has not yet been identified. This suggests that there are much more genetic causes to discover in the dog genome. Additionally, the dog serves well as a model organism to investigate IRDs as it is sharing morphological and genetic similarities with humans. For these reasons, proper software, a canine reference genome of high quality, and smart implementation of bioinformatic tools and methods are a big advantage to increase chances of finding new causative genetic variants and subsequently enable faster detection of possible preventions of the disease or at least alleviating its symptoms via early diagnosis. In this project, a pre-existing pipeline consisting of Bash scripts was stepwise improved with the goal to increase its efficiency. After controlling whether previous data could still be reproduced with the old pipeline in a first step, the software was exchanged to more updated versions in a second step. A main change was the replacement of the mapping tool Burrows-Wheeler Aligner (BWA) from bwa mem to bwa-mem2 mem, and the update of deprecated Genome Analysis Toolkit (GATK) 3.7 to version 4.3 or 4.4. Thirdly, the scripts were adapted from using the older canine reference genome CanFam3.1 to CanFam4. In a fourth step, for automatization and fastening the running time, the pipeline steps were implemented into the workflow management system Nextflow. Additionally, this step was partly aiming to make the pipeline in concordance with the FAIR-principles. All steps were tested on the same test data set, a Labrador retriever family trio, in which one genetic cause for a canine form of the IRD Stargardt disease in a previous study had been detected, namely an insertion in the ABCA4 gene. Lastly, the workflow was also tested on a second data set of a novel IRD of unknown genetic origin on two sibling pairs of Chinese Crested Dogs (CCR). The adjustment of the pipeline shows similar results regarding the change of mapping tool. Introducing the new reference genome revealed a drop of average coverage by one read average for when using CanFam4, while other results were similar. Using the new reference genome increased the number of unknown variants compared to findings with CanFam3.1. However, the known causative variant for the canine form of Stargardt disease, an insertion in ABCA4 gene, could be found in all cases. The run with Nextflow produced identical results to when the respective steps were run with Bash scripts, but it reduced the running time. Running the workflow on the new data set (CCR) and subsequent annotation and filtering indicate new candidates which could be further investigated as a potential cause for this currently unknown cause for an IRD.
Published: 2023

20. Automatiserad kvantitativ fasavbildning : för övervakning av bakteriecellstillväxt under kemisk påverkan

Author: Zargani, Samuel
Subjects: Bioinformatics (Computational Biology), image analysis, QPI, Bildanalys, Bioinformatik (beräkningsbiologi), QLSI, Quantitative Phase Imaging, kvantitativ fasavbildning
Abstract: Bacteria have a vital role in ecosystems and the human health. However, the effect of environmental pollution on them at the single-cell level are still not well understood. Label-free imaging has become increasingly important within science. It has allowed for the visualization of cells without the use of molecular tools or labels. Quadriwave lateral shearing interferometry (QLSI) is a label-free method that requires short exposure of cells to light. QLSI is a technique within Quantitative phase imaging (QPI), and the method records and stores information about the optical phase delay (OPD) as pixel values within an interferogram. These interferograms can be used to extract features of biological significance, such as dry mass. Here, an automated image analysis pipeline was developed to convert interferometric images captured with the NIS software and a commercial SID4 sC8 camera (from Phasics) from a fully automated microscope. The pipeline segments regions of interest (ROI) from the OPD images and calculates various parameters of biological significance, such as dry mass. This pipeline enables the quantification of dry-mass changes among individual cells, for multiple positions. The developed image analysis pipeline provides a powerful tool for studying bacteria at the single-cell level and has significant implications for pharmacology, ecotoxicology, and engineering.
Published: 2023

21. Developing Automated Cell Segmentation Models Intended for MERFISH Analysis of the Cardiac Tissue by Deploying Supervised Machine Learning Algorithms

Author: Rune, Julia
Subjects: Heart Failure, Bioinformatics (Computational Biology), Kardiologi, Bioinformatics and Systems Biology, Cell- och molekylärbiologi, Bioinformatik och systembiologi, Cell Segmentation, Cellsegmentering, Bioinformatik (beräkningsbiologi), Cardiac and Cardiovascular Systems, Supervised Machine Learning, Övervakad Maskininlärning, Cellpose, MERFISH, Hjärtsvikt, Cell and Molecular Biology
Abstract: Följande studie behandlar utvecklandet av automatiserade cellsegmenteringsmodeller med avsikt att identifiera gränser mellan celler i hjärtvävnad. Syftet är att möjliggöra analys av data genererad från multiplexed error-robust in situ hybridization (MERFISH). MERFISH är en spatial transcriptomics-teknik som till skillnad från exempelvis single-cell RNA sequencing (ScRNA-seq) och single molecule fluorescence in situ hybridization (smFISH), möjliggör profilering av hundratals RNA-sekvenser hos enskilda celler utan att förlora dess rumsliga kontext. I Kosuri laboratoriet på Salk Institute of Biological Studies i San Diego tillämpas MERFISH på mushjärtan. Syftet är att få en djupare insikt i hur celler är organiserade i friska hjärtan, och hur denna struktur ändras i och med åldring och sjukdom. Att extrahera meningsfull information från MERFISH medför dock en betydande utmaning - en exakt cellsegmentering. Studien bidrar följaktligen till utvecklandet av segmenteringsmodeller för att kringgå de utmaningar som står i vägen för all efterföljande analys. Då klassiska segmenteringsalgoritmer är otillräckliga för att segmentera den komplexa vävnad som hjärtat utgörs av, tillämpades några av dagens mest avancerade och framstående maskininlärningsalgoritmer inom fältet, kallade Cellpose och Omnipose. Givet den täta och heterogena hjärtvävnaden, som härstammar från en bred distribution av celltyper och geometrier, utvecklades två separata modeller; en för att täcka både mindre celler och kardiomyocyter skurna på tvärsnittet; och en för att enbart segmentera kardiomyocyter skurna i longitudinell riktning. Den förstnämnda modellen utvecklades och tränades i Cellpose, och uppnådde en träffsäkerhet på 91.2%. Modellen för longitudinella kardiomyocyter utvecklades istället både i Cellpose och Omnipose för att utvärdera vilket nätverk som är bäst lämpat för ändamålet. Ingen av nätverken lyckades uppnå en tillräckligt hög träffsäkerhet för att vara applicerbar, och är därmed i behov av fortsatt träning. Modellen genererad i Omnipose bedöms dock vara mest lovande, givet dess mer heltäckande segmentering. Ytterligare utvecklingsområden för framtiden innefattar segmentering av celler i fibros-täta regioner, samt att utveckla en 3D-segmentering av hela hjärtat för att uppnå en mer komplett MERFISH-analys. Sammanfattningsvis har de genererade segmenteringsmodellerna banat väg för möjliggörandet av en rigorös MERFISH-analys av hjärtat. Genom att avslöja några av de strukturella och funktionella orsakerna till hjärtsvikt på en cellulär nivå, kan vi således på sikt bidra till utvecklingen av mer effektiva terapeutiska strategier. The following study delves into the development of automated cell segmentation models, with the intention of identifying boundaries between cells in the cardiac tissue for analysing spatial transcriptomics data. Addressing the limitations of alternative techniques like single-cell RNA sequencing (ScRNA-seq) and single molecule fluorescence in situ hybridization (smFISH), the study underscores the innovative use of multiplexed error-robust fluorescence in situ hybridization (MERFISH) deployed by the Kosuri Lab at Salk Institute for Biological Studies. This advanced imaging-based technique allows for a single-cell transcriptome profiling of hundreds of different transcripts while retaining the spatial context of the tissue. The technique can accordingly reveal how the organization of cells within a healthy heart is altered during disease. However, the extraction of meaningful data from MERFISH poses a significant challenge - accurate cell segmentation. This thesis therefore presents the development of a robust model for cell boundary identification within cardiac tissue, leveraging some of the advanced supervised machine learning algorithms in the field, named Cellpose and Omnipose. Due to the dense and highly heterogeneous tissue- stemming from a wide distribution of cell types and shapes- two separate models had to be developed; one that covers the smaller cells and the cross-sectioned cardiomyocytes, and correspondingly one to cover the longitudinal cardiomyocytes. The cross-section model was successfully developed to achieve an accuracy of 91.2%, whereas the longitudinal model still needs further improvements before being implemented. The thesis acknowledges potential areas for improvement, emphasizing the need to further improve the segmentation of longitudinal cardiomyocytes, tackle the challenges with segmenting cells within fibrotic regions of the diseased heart, as well as achieving a precise 3D cell segmentation. Nonetheless, the generated models have paved the way towards enabling efficient downstream MERFISH analysis to ultimately understand the structural and functional dynamics of heart failure at a cellular level, aiding the development of more effective therapeutic strategies.
Published: 2023

22. Covariance properties under natural image transformations for the generalized Gaussian derivative model for visual receptive fields

Author: Lindeberg, Tony
Subjects: vision, Bioinformatics (Computational Biology), Neurosciences, receptive field, Galilean covariance, theoretical biology, lateral geniculate nucleus, Datorseende och robotik (autonoma system), image transformations, affine covariance, scale covariance, Bioinformatik (beräkningsbiologi), theoretical neuroscience, primary visual cortex, Neurovetenskaper, Computer Vision and Robotics (Autonomous Systems)
Abstract: The property of covariance, also referred to as equivariance, means that an image operator is well-behaved under image transformations, in the sense that the result of applying the image operator to a transformed input image gives essentially a similar result as applying the same image transformation to the output of applying the image operator to the original image. This paper presents a theory of geometric covariance properties in vision, developed for a generalized Gaussian derivative model of receptive fields in the primary visual cortex and the lateral geniculate nucleus, which, in turn, enable geometric invariance properties at higher levels in the visual hierarchy. It is shown how the studied generalized Gaussian derivative model for visual receptive fields obeys true covariance properties under spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations. These covariance properties imply that a vision system, based on image and video measurements in terms of the receptive fields according to the generalized Gaussian derivative model, can, to first order of approximation, handle the image and video deformations between multiple views of objects delimited by smooth surfaces, as well as between multiple views of spatio-temporal events, under varying relative motions between the objects and events in the world and the observer. We conclude by describing implications of the presented theory for biological vision, regarding connections between the variabilities of the shapes of biological visual receptive fields and the variabilities of spatial and spatio-temporal image structures under natural image transformations. Specifically, we formulate experimentally testable biological hypotheses as well as needs for measuring population statistics of receptive field characteristics, originating from predictions from the presented theory, concerning the extent to which the shapes of the biological receptive fields in the primary visual cortex span the variabilities of spatial and spatio-temporal image structures induced by natural image transformations, based on geometric covariance properties. QC 20230328 Covariant and invariant deep networks
Published: 2023

23. Modelling Large Protein Complexes

Author: Chim, Ho Yeung
Subjects: Protein Modeling, Protein Complex, Bioinformatics (Computational Biology), Monte Carlo Tree Search, Bioinformatik (beräkningsbiologi)
Abstract: AlphaFold [Jumper et al., 2021, Evans et al., 2022] is a deep learning-based method that can accurately predict the structure of single- and multiple-chain proteins. However, its accuracy decreases with an increasing number of chains, and GPU memory limits the size of protein complexes that can be predicted. Recently, Elofsson’s groupintroduced a Monte Carlo tree search method, MoLPC, that can predict the structure of large complexes from predictions of sub-components [Bryant et al., 2022b]. However, MoLPC cannot adjust for errors in the sub-component predictions and requires knowledge of the correct protein stoichiometry. Large protein complexes are responsible for many essential cellular processes, such as mRNA splicing [Will and Lührmann, 2011], protein degradation [Tanaka, 2009], and protein folding [Ditzel et al., 1998]. However, the lack of structural knowledge of many large protein complexes remains challenging. Only a fraction of the eukaryoticcore complexes in CORUM [Giurgiu et al., 2019] have homologous structures covering all chains in PDB, indicating a significant gap in our structural understanding of protein complexes. AlphaFold-Multimer [Evans et al., 2022] is the only deep learning method that can predict the structure of more than two protein chains, trained on proteins of up to 20 chains, and can predict complexes of up to a few thousand residues, where memory limitations come into play. Another approach, MoLPC, is to predict the structure of sub-components of large complexes and assemble them. It has shown that it is possible to manually assemble large complexes from dimers manually [Burke et al., 2021] or use Monte Carlo tree search [Bryant et al., 2022b]. One limitation of the previous MoLPC approach is its inability to account for errors in sub-component prediction. The addition of small errors in each sub-component can propagate to a significant error when building the entire complex, leading toMoLPC’s failure. To overcome this challenge, the Monte Carlo Tree Search algorithms in MoLPC2 is enhanced to assemble protein complexes while simultaneously predicting their stoichiometry. Using MoLPC2, we accurately predicted the structures of 50 out of 175 non-redundant protein complexes (TM-score >0.8), while MoLPC only predicted 30. It should be noted that improvements introduced in AlphaFold version 2.3 enable the prediction of larger complexes, and if stoichiometry is known, it can accurately predict the structures of 74 complexes. Our findings suggest that assembling symmetrical complexes from sub-components results in higher accuracy while assembling asymmetrical complexes remains challenging.
Published: 2023

24. Predicting and classifying atrial fibrillation from ECG recordings using machine learning

Author: Bogstedt, Carl
Subjects: Machine Learning, Bioinformatics (Computational Biology), Bioinformatics, Atrial Fibrillation, Bioinformatik (beräkningsbiologi), Rough Sets, Classification, Electrocardiogram, XGBoost
Abstract: Atrial fibrillation is one of the most common types of heart arrhythmias, which can cause irregular, weak and fast atrial contractions up to 600 beats per minute. Atrial fibrillation has increased prevalence with age and is associated with increased risks of ischemia, as blood clots can form due to the weak contractions. During prolonged periods of atrial fibrillation, the atria can undergo a process called atrial remodelling. This causes electrophysiological and structural changes to the atria such as increased atrial size and changes to calcium ion densities. These changes themselves promotes the initiation and propagation of atrial fibrillation, which makes early detection crucial. Fortunately, atrial fibrillation can be detected on an electrocardiogram. Electrocardiograms measures the electrical activity of the heart during its cardiac cycle. This includes the initiation of the action potential, the depolarization of the atria and ventricles and their repolarization. On the electrocardiogram recording, these are seen as peaks and valleys, where each peak and valley can be traced back to one of these events. This means that during atrial fibrillation, the weak, irregular and fast atrial contractions can all be detected and measured. The aim of this project was to develop a machine learning model that could predict onset of atrial fibrillation, and that could classify ongoing atrial fibrillation. This was achieved by training one multiclass classification machine learning model using XGBoost, and three binary classification machine learning models using ROSETTA, on electrocardiogram recordings of people with and without atrial fibrillation. XGBoost is a tree boosting system which uses tree-like structures to classify data, while ROSETTA is a rule-based classification model which creates rules in an IF and THEN format to make decisions. The recordings were labelled according to three different classes: no atrial fibrillation, atrial fibrillation or preceding atrial fibrillation. The XGBoost model had a prediction accuracy of 99.3%, outperforming the three ROSETTA models and other atrial fibrillation classification and prediction models found. The ROSETTA models had high accuracies on the learning set, however, the predictions were subpar, indicating faulty settings for this type of data. The results in this project indicate that the models created can be used to accurately classify and predict onset of and ongoing atrial fibrillation, serving as a tool for early detection and verification of diagnosis.
Published: 2023

25. KIF-Key Interactions Finder : A program to identify the key molecular interactions that regulate protein conformational changes

Author: Rory M. Crean, Joanna S. G. Slusky, Peter M. Kasson, and Shina Caroline Lynn Kamerlin
Subjects: Bioinformatics (Computational Biology), Biochemistry and Molecular Biology, Bioinformatik (beräkningsbiologi), General Physics and Astronomy, Physical and Theoretical Chemistry, Biokemi och molekylärbiologi
Abstract: Simulation datasets of proteins (e.g., those generated by molecular dynamics simulations) are filled with information about how a non-covalent interaction network within a protein regulates the conformation and, thus, function of the said protein. Most proteins contain thousands of non-covalent interactions, with most of these being largely irrelevant to any single conformational change. The ability to automatically process any protein simulation dataset to identify non-covalent interactions that are strongly associated with a single, defined conformational change would be a highly valuable tool for the community. Furthermore, the insights generated from this tool could be applied to basic research, in order to improve understanding of a mechanism of action, or for protein engineering, to identify candidate mutations to improve/alter the functionality of any given protein. The open-source Python package Key Interactions Finder (KIF) enables users to identify those non-covalent interactions that are strongly associated with any conformational change of interest for any protein simulated. KIF gives the user full control to define the conformational change of interest as either a continuous variable or categorical variable, and methods from statistics or machine learning can be applied to identify and rank the interactions and residues distributed throughout the protein, which are relevant to the conformational change. Finally, KIF has been applied to three diverse model systems (protein tyrosine phosphatase 1B, the PDZ3 domain, and the KE07 series of Kemp eliminases) in order to illustrate its power to identify key features that regulate functionally important conformational dynamics.
Published: 2023

26. Saturation mutagenesis charts the functional landscape of Salmonella ProQ and reveals a gene regulatory function of its C-terminal domain

Author: Rizvanovic, Alisa, Kjellin, Jonas, Söderbom, Fredrik, and Holmqvist, Erik
Subjects: Transcriptional Activation, Bioinformatics (Computational Biology), AcademicSubjects/SCI00010, High-Throughput Nucleotide Sequencing, RNA-Binding Proteins, Gene Expression Regulation, Bacterial, Adaptation, Physiological, Microbiology in the medical area, RNA, Bacterial, Amino Acid Substitution, Protein Domains, Salmonella, Bioinformatik (beräkningsbiologi), Mikrobiologi inom det medicinska området, RNA and RNA-protein complexes, Mutagenesis, Site-Directed, Amino Acid Sequence, RNA, Messenger
Abstract: The global RNA-binding protein ProQ has emerged as a central player in post-transcriptional regulatory networks in bacteria. While the N-terminal domain (NTD) of ProQ harbors the major RNA-binding activity, the role of the ProQ C-terminal domain (CTD) has remained unclear. Here, we have applied saturation mutagenesis coupled to phenotypic sorting and long-read sequencing to chart the regulatory capacity of Salmonella ProQ. Parallel monitoring of thousands of ProQ mutants allowed mapping of critical residues in both the NTD and the CTD, while the linker separating these domains was tolerant to mutations. Single amino acid substitutions in the NTD associated with abolished regulatory capacity strongly align with RNA-binding deficiency. An observed cellular instability of ProQ associated with mutations in the NTD suggests that interaction with RNA protects ProQ from degradation. Mutation of conserved CTD residues led to overstabilization of RNA targets and rendered ProQ inert in regulation, without affecting protein stability in vivo. Furthermore, ProQ lacking the CTD, although binding competent, failed to protect an mRNA target from degradation. Together, our data provide a comprehensive overview of residues important for ProQ-dependent regulation and reveal an essential role for the enigmatic ProQ CTD in gene regulation.
Published: 2021

27. A Robust and Precise ConvNet for Small Non-Coding RNA Classification (RPC-snRC)

Author: Sheraz Ahmed, Johan Trygg, Muhammad Nabeel Asim, Muhammad Imran Malik, Andreas Dengel, and Christoph Zehe
Subjects: Source code, General Computer Science, Computer science, media_common.quotation_subject, Feature vector, Feature extraction, Densenet, Machine learning, computer.software_genre, ResNet, 03 medical and health sciences, 0302 clinical medicine, Margin (machine learning), General Materials Science, Nucleotide, Protein secondary structure, 030304 developmental biology, media_common, chemistry.chemical_classification, 0303 health sciences, Bioinformatics (Computational Biology), RNA sequence analysis, business.industry, small non-coding RNA classification, General Engineering, RNA, DenseNet, Non-coding RNA, Small non-coding RNA classification, Resnet, Support vector machine, Identification (information), chemistry, 030220 oncology & carcinogenesis, Benchmark (computing), Bioinformatik (beräkningsbiologi), Artificial intelligence, lcsh:Electrical engineering. Electronics. Nuclear engineering, business, computer, lcsh:TK1-9971
Abstract: Small non-coding RNAs (ncRNAs) are attracting increasing attention as they are now considered potentially valuable resources in the development of new drugs intended to cure several human diseases. A prerequisite for the development of drugs targeting ncRNAs or the related pathways is the identification and correct classification of such ncRNAs. State-of-the-art small ncRNA classification methodologies use secondary structural features as input. However, such feature extraction approaches only take global characteristics into account and completely ignore co-relative effects of local structures. Furthermore, secondary structure based approaches incorporate high dimensional feature space which is computationally expensive. The present paper proposes a novel Robust and Precise ConvNet (RPC-snRC) methodology which classifies small ncRNAs into relevant families by utilizing their primary sequence. RPC-snRC methodology learns hierarchical representation of features by utilizing positioning and information on the occurrence of nucleotides. To avoid exploding and vanishing gradient problems, we use an approach similar to DenseNet in which gradient can flow straight from subsequent layers to previous layers. In order to assess the effectiveness of deeper architectures for small ncRNA classification, we also adapted two ResNet architectures having a different number of layers. Experimental results on a benchmark small ncRNA dataset show that the proposed methodology does not only outperform existing small ncRNA classification approaches with a significant performance margin of 10% but it also gives better results than adapted ResNet architectures. To reproduce the results Source code and data set is available at https://github.com/muas16/small-non-coding-RNA-classification.
Published: 2021

28. A time-causal and time-recursive scale-covariant scale-space representation of temporal signals and past time

Author: Tony Lindeberg
Subjects: The present, Time-recursive, General Computer Science, Signalbehandling, Wavelet analysis, Theoretical neuroscience, Temporal, Time-causal, Scale space, Time, Scale covariance, Memory, Signal, Delay, Bioinformatics (Computational Biology), Matematik, Scale, Time-frequency analysis, Theoretical biology, FOS: Biological sciences, Quantitative Biology - Neurons and Cognition, Signal Processing, Bioinformatik (beräkningsbiologi), Perceptual agent, Neurons and Cognition (q-bio.NC), Mathematics, Biotechnology
Abstract: This article presents an overview of a theory for performing temporal smoothing on temporal signals in such a way that: (i) temporally smoothed signals at coarser temporal scales are guaranteed to constitute simplifications of corresponding temporally smoothed signals at any finer temporal scale (including the original signal) and (ii) the temporal smoothing process is both time-causal and time-recursive, in the sense that it does not require access to future information and can be performed with no other temporal memory buffer of the past than the resulting smoothed temporal scale-space representations themselves. For specific subsets of parameter settings for the classes of linear and shift-invariant temporal smoothing operators that obey this property, it is shown how temporal scale covariance can be additionally obtained, guaranteeing that if the temporal input signal is rescaled by a uniform scaling factor, then also the resulting temporal scale-space representations of the rescaled temporal signal will constitute mere rescalings of the temporal scale-space representations of the original input signal, complemented by a shift along the temporal scale dimension. The resulting time-causal limit kernel that obeys this property constitutes a canonical temporal kernel for processing temporal signal in real-time scenarios when the regular Gaussian kernel cannot be used because of its non-causal access to information from the future and we cannot additionally require the temporal smoothing process to comprise a complementary memory of the past beyond the information contained in the temporal smoothing process itself, which in this way also serves as a multi-scale temporal memory of the past. This theory is generally applicable for both: (i) modelling continuous temporal phenomena over multiple temporal scales and (ii) digital processing of measured temporal signals in real time., 37 pages, 15 figures
Published: 2022

29. Voxel-wise Longitudinal Analysis of Weight Gain from Different Dietary Fats using Image Registration-Based 'Imiomics' Analysis

Author: Andersson, Vendela
Subjects: Bioinformatics (Computational Biology), Imiomics, Bioinformatik (beräkningsbiologi), food and beverages, Radiology, Magnetic Resonance Imaging, Image registration
Abstract: There is an emerging global epidemic of obesity and related complications, such as type 2diabetes (T2D). Alterations in body composition (adipose tissue, muscle volume and fatcontents) are known to be associated with an increased metabolic risk. Understanding of theunderlying mechanisms is key for development of novel intervention strategies. One study investigating the effect on body composition by different diets is Lipogain1. In this study, it was found that a small weight gain induced by polyunsaturated fats (PUFA, n=19) or saturated fats (SFA, n=20) had very different effects on body fat, liver fat and lean tissue mass respectively. The SFA group gained more liver fat and fat mass in general, while the PUFA group gained more muscle mass. These results were determined by magnetic resonance imaging. The goal of this project was to visualize the results from Lipogain1 by utilizing the noveltechnique Imiomics. Imiomics is a method for statistical analysis of whole-body medical images. By utilizing image registration, all images are transformed to a common reference space. This enables point-wise comparisons between all images included in the analysis. In this project, mean images of the alterations in fat content and local volume change of the two groups were created. These were used to visualize the alterations in body composition from the study. Additionally, statistical tests were used to visualize statistically significant differences between the groups. Differences between the groups could be seen in the mean images. Mainly a higher fat content increase was seen in SFA in comparison to PUFA. There was also a larger volume expansion in fat tissue in SFA than in PUFA, while PUFA instead had a larger volume expansion in muscles. An unexpected result was also found; the liver had expanded in PUFA but not in SFA. Unfortunately, few significant differences could be visualized between the groups when the statistical test was performed. The conclusion was that this method is promising for visualization of these kinds of studies, especially due to the potential of finding new, unexpected results. However, a somewhat larger cohort and possibly larger alterations in body composition might be needed to be able to visualize and quantify statistically significant differences between the groups on a voxel-wise level.
Published: 2022

30. Hybrid Variational autoencoder för analys av enkelcells RNA-sekvensering data

Author: Narrowe Danielsson, Sarah
Subjects: Variational Autoencoder, individuell cellanalys, Bioinformatics (Computational Biology), Computer and Information Sciences, Bioinformatics, Bioinformatik (beräkningsbiologi), Bioinformatik, Data- och informationsvetenskap, scRNAseq, Single-Cell Analysis
Abstract: Single-cell analysis means to analyze cells on an individual level. This individual analysis enhances the investigation of the heterogeneity among and the classification of individual cells. Single-cell analysis is a broad term and can include various measurements. This thesis utilizes single-cell RNA sequence data that measures RNA sequences representing genes for individual cells. This data is often high-dimensional, with tens of thousands of RNA sequences measured for each cell. Dimension reduction is therefore necessary when analyzing the data. One proposed dimension reduction method is the unsupervised machine learning method variational autoencoders. The scVI framework has previously implemented a variational autoencoder for analyzing single-cell RNA sequence data. The variational autoencoder of the scVI has one latent space with a Gaussian distribution. Several extensions have been made to the scVI framework since its creation. This thesis proposes an additional extension consisting of a variational autoencoder with two latent spaces, called hybridVI. One of these latent spaces has a Gaussian distribution and the other a von Mises-Fisher distribution. The data is separated between these two latent spaces, meaning that some of the genes go through one latent space and the rest go through the other. In this thesis the cell cycle genes go through the von Mises-Fisher latent space and the rest of the genes go through the Gaussian latent space. The motivation behind the von Mises-Fisher latent space is that cell cycle genes are believed to follow a circular distribution. Putting these genes through a von Mises-Fisher latent space instead of a Gaussian latent space could provide additional insights into the data. The main focus of this thesis was to analyze the impact this separation. The analysis consisted of comparing the performance of the hybridVI model, to the original scVI variational autoencoder. The comparison utilized three annotated datasets, one peripheral blood mononuclear cell dataset, one cortex cell dataset, and one B cell dataset collected by the Henriksson lab at Umeå University. The evaluation metrics used were the adjusted rand index, normalized mutual information and a Wilcoxon signed ranks test was used to determine if the results had statistical significance. The results indicate that the size of the dataset was essential for achieving robust and statistically significant results. For the two datasets that yielded statistically significant results, the scVI model performed better than the hybridVI model. However, more research analyzing biological aspects is necessary to declare the hybridVI model’s effect on the biological interpretation of the results. Individuell cellanalys är en relativt ny metod som möjliggör undersökning av celler på indivudiell nivå. Det här examensarbetet analyserar RNA sekvens data, där RNA sekvenser är specifierade för individuella celler. Den här sortens data är ofta högdimensionell med flera tusen gener noterade för varje cell. För att möjliggöra en analys av den här datan krävs någon form av dimensionreducering. En föreslagen metod är den ovövervakade maskininlärningsmetoden variational autoencoders. Ett ramverk, scVI, har framtagit en variational autoencoder designad för att hantera den här sortens data. Den här modellen har endast en latentrymd med en normalfördelning. Det här examensarbetet föreslår en utökning av det här ramverket med en variational autoencoder med två latentrymder,där den ena är normalfördelad och den andra följer en von Mises-Fisher fördelning. Motiveringen till en sådan fördelning är att cellcykelgener är antagna att tillhöra en cirkulär fördelning. Cellcykelgenerna i datan kan därmed hanteras av den cirkulära latentrymden. Huvudfokuset i den här studien är att undersöka om den här separationen av gener kan förbättra modellens förmåga att hitta korrekta kluster. Experimentet utfördes på tre annoterade dataset, ett som bestod av perifera mononukleära blodceller, ett som bestod av hjärnbarksceller och ett som bestod av B celler insamlat av Henrikssongruppen vid Umeå universitet. Modellen från scVI ramverket jämfördes med den nya metoden med två latentrymder, hybridVI. Måtten som användes för att bedöma de modellerna var adjusted rand index och normaliserad mutual information och ett Wilcoxon Signed-Ranks test användes för att bedöma resultatens statistiska signifikans. Resultaten påvisar att de båda modellerna preseterar bättre och mer konsekvent för större dataset. Två dataset gav statistiskt signifikanta resultat och visade att scVI modellen presterade bättre än hybridmodellen. Det behövs dock en biologisk analys av resultaten för att undersöka vilken modells resultat som har mest biologisk relevans.
Published: 2022

31. Analyzing Cell Painting images using different CNNs and Conformal Prediction variations : Optimization of a Deep Learning model to predict the MoA of different drugs

Author: Hillver, Anna
Subjects: Machine Learning, Bioinformatics (Computational Biology), Deep Learning, Industriell bioteknik, Artificial Intelligence, AI, Convolutional Neural Networks, Bioinformatik (beräkningsbiologi), Cell Painting, Conformal Prediction, Image Analysis, CNN, Industrial Biotechnology
Abstract: Microscopy imaging based techniques, such as the Cell Painting assay, could be used to generate images that visualize the Mechanism of Action (MoA) of a drug, which could be of great use in drug development. In order to extract information and predict the MoA of a new compound from these images we need powerful image analysis tools. The purpose with this project is to further develop a Deep Learning model to predict the MoA of different drugs from Cell Painting images using Convolutional Neural Networks (CNNs) and Conformal Prediction. The specific task was to compare the accuracy of different CNN architectures and to compare the efficiency of different nonconformity functions. During the project the CNN architectures ResNet50, ResNet101 and DenseNet121 were compared as well as the nonconformity functions Inverse Probability, Margin and a combination of them both. No significant difference in accuracy between the CNNs and no difference in efficiency between the nonconformity functions was measured. The results showed that the model could predict the MoA of a compound with high accuracy when all compounds were used both in training, validation and test of the model, which validates the implementations. However, it is desirable for the model to be able to predict the MoA of a new compound if the model has been trained on other compounds with the same MoA. This could not be confirmed through this project and the model needs to be further investigated and tested with another dataset in order to be used for that purpose.
Published: 2022

32. Development and Application of Computational Models for Peptide-Protein Complexes

Author: Johansson-Åkhe, Isak
Subjects: Bioinformatics (Computational Biology), Bioinformatik (beräkningsbiologi)
Abstract: Protein-protein interactions between a protein and a smaller protein fragment or a disordered segment of a protein are called peptide-protein interactions. Such interactions are commonplace in nature and vital for normal cell function in humans. For example, the onco-protein Myc con- tains a large disordered region with several segments involved in peptide-protein interactions as part of transcription regulation, and it is mis-regulated in the vast majority of all human can- cers. As such, understanding the structural details of peptide-protein interactions on an atomic level is a necessary endeavor for understanding disease pathways as well as facilitating targeted drug-design. While experimental methods for structure determination such as X-ray crystallography and NMR can determine the structure of many peptide-protein complexes, these methods are time- consuming and costly. Additionally, the disordered nature of peptides and a sometimes lower binding affinity than for protein-protein binding can lead to transient or weak (but still highly specific) interactions impossible to fully capture with experimental methods. This leads to the need for computational methods as support and complement. Such methods have classically used statistical potentials or simple template search approaches, but as the number of deposited structures in the protein databank (PDB) grows so does the potential for supervised machine learning. The papers in this thesis present the contributions of the author to the field of peptide-protein in- teraction complex prediction, mainly through use of machine learning models. The first papers apply a Random Forest classifier to detect similarities between binding interfaces deposited in the PDB and a peptide-protein pair being investigated to find the optimal templates for struc- ture prediction. In excess of producing predictions with good self-evaluation of performance, the development of the method also confirmed theories on the similarity of protein-protein, domain-domain, and peptide-protein interfaces. Two more method for peptide-protein docking are presented in later papers. One utilizes graph convolution neural networks to improve model selection from rigid-body-docking methods by including MSA profile information as a feature, which also lead to the discovery that while profile information such as position conservation does improve predictive performance, something also seen in the first papers, the most impor- tant features are the ones describing the structural details of the complex and the bonds between residues. The other uses a graph neural network as an additional scoring term to improve upon the already state-of-the-art performing local refinement method FlexPepDock, and is capable of refining even models generated by AlphaFold-multimer. Finally, two manuscripts focus on the application of computational approaches for research into the interactions of human cMyc with TBP and PPP1R10, respectively. In the first of these pa- pers, the template-based peptide-protein complex prediction methods developed in the earlier papers of the thesis are employed together with prior knowledge of the interaction to model the complex to a high degree of certainty not achievable by NMR alone. In the second of these papers, experimental data is used as a basis for computational modeling of the complex, and the modeled complex could act as a basis for further experiments characterizing the interaction.
Published: 2022

33. Siamese Neural Networks for Regression: Similarity-BasedPairing and Uncertainty Quantification

Author: Zhang, Yumeng
Subjects: Bioinformatics (Computational Biology), Bioinformatik (beräkningsbiologi)
Abstract: Here we present a similarity-based pairing method for generating compound pairs to train a Siamese Neural Network. In comparison with the conventional exhaustive pairing of N2/2 pairs (N being the sizeof the training set), this method results in N-1 pairs, significantly reducing the training time. It exhibits a better prediction performance consistently on the three physicochemical property datasets, using a multilayer perceptron with the ECFP4 fingerprint. We further include into the Siamese Neural Network the pre-trained Chemformer which extracts task-specific chemical features from the input SMILES strings. With the n-shot learning, we propose a means to measure the prediction uncertainty. Our results demonstrate that the higher accuracy is indeed associated with the lower prediction uncertainty. In addition, we discuss implications of the similarity principle in machine learning.
Published: 2022

34. Comparison of Support Vector Machines and Deep Learning For QSAR with Conformal Prediction

Author: Deligianni, Maria
Subjects: conformal prediction, Bioinformatics (Computational Biology), machine learning, Bioinformatik (beräkningsbiologi), deep learning, qsar, support vector machines
Abstract: Quantitative Structure Activity Relationship (QSAR) is a very useful computa-tional method which has facilitated great progress in drug development [1]. Thismethod can be used to predict a molecule’s activity against a certain target justby comparing its structural characteristics (i.e., molecular descriptors) with thosebelonging to molecules of known activity. QSAR modeling is fueled by online freedatabases consisting of millions of active and inactive molecules and by MachineLearning (ML) Methods that enable data analysis. To ensure successful implemen-tation of ML models, there is a range of evaluation methods to estimate their perfor-mance and applicability domain. So far, a great deal of research has focused on theuse of Support Vector Machines (SVMs) to classify molecules with the use of theirMolecular Signature Fingerprints as descriptors [2]. However, another MachineLearning algorithm, Deep Neural Networks (DNNs), an improvement of single-layer Neural Networks, is rising in popularity in various fields including moleculeclassification. The two models were compared using CPSign software which intro-duces Conformal Prediction, to evaluate the reliability of model predictions basedon performance for individual compounds rather than mean performance on agiven test set. Three types of descriptors were used: Molecular Signature Finger-prints, Extended Connectivity Fingerprints and physicochemical descriptors. Thecomparison showed that Multilayer Perceptron (MLP) which was used as a DNNrepresentative in current context, had performance similar to the shallower SVMmodels but additionally demanded longer training times [3]. It can be concludedthat in the field of QSAR with the aforementioned descriptors, when the numberof examples used for training is not immense, Support Vector Machines might per-form equally well and demand less resources and time than the more sophisticated MLPs.
Published: 2022

35. Phylogenomics of Ascetosporea

Author: Bhawe, Harshal Kunal
Subjects: phylogenetics, Bioinformatics (Computational Biology), Bioinformatics and Systems Biology, mussel, Bioinformatik (beräkningsbiologi), ascetosporea, bioinformatics, Bioinformatik och systembiologi
Abstract: Ascetosporea is a class of poorly studied unicellular eukaryotes that function as parasites of marine invertebrates. These parasites cause mass mortality events in aquaculture species such as oysters and mussels. The economic importance of these aquaculture species should lead to more attention on the genomics of Ascetosporea and their place on the evolutionary tree of life. With the onset of global warming and rising sea levels and temperatures, many emerging pathogens have been seen and until these are sequenced and analysed, it is difficult to make any conclusions about their relationships and evolution. As there aren’t many genomes and transcriptomes available for Ascetosporea, their position in the larger eukaryotic tree of life remains hypothetical. To attempt to remedy this lack of information, the Burki lab has recently generated sequencing data through sample collection and sequencing for these organisms (genomes and transcriptomes). A curated dataset of the various eukaryotic species was previously created and newly sampled and sequenced Ascetosporean genomes of Paramarteilia sp., Marteilia pararefringens, Paramikrocytos canceri, etc. from multiple sampling locations like Ireland, Norway, Sweden, and the UK were included. These could increase the genomic and transcriptomic data available for Ascetosporea and help to resolve the relationships within Ascetosporea. A few reasons why this group has not yet been placed on the tree of life are that the samples are from host tissue, which makes it difficult to sequence these parasites. These Ascetosporeans have also been seen to be very fast-evolving. After building phylogenetic relationships with single gene trees to allow for the identification of possible contaminants and paralogs, it was seen that there was a lot of contamination in Ascetosporea, due to the sampling being from host tissue material (hosts are open to the environment). After cleaning and filtering the possible contaminated genes, the trees were remade and a possible link between a fungal group called Microsporidia and Ascetosporea was observed in a few genes. This was hypothesized to be lateral gene transfer between the two groups resulting from their similar lifestyles and infection of invertebrates. There were complications like contamination and short blast hits that arose during analysis, and these could be caused by problems by fragmentation in the genome. This fragmentation could have negative effects on genome annotation predictions and consequently phylogenetic and phylogenomic analysis. Due to this and the challenging nature of collecting samples, the read coverage for the genomes is low but it can be used to perform phylogenetic and phylogenomic studies using currently available data and methods. Another expected result was that the sequenced data had contaminants, and a thorough and comprehensive search would have to be conducted on a dataset-wide level to remove any contaminants.
Published: 2022

36. Generative Modelling and Probabilistic Inference of Growth Patterns of Individual Microbes

Author: Nagarajan, Shashi
Subjects: Nuisance Variable Elimination, Datorsystem, Bioinformatics (Computational Biology), Computer Systems, Bioinformatik (beräkningsbiologi), microfluidic single cell cultivations, probabilistic inference, Sannolikhetsteori och statistik, Variational Inference, Probability Theory and Statistics, Generative modelling
Abstract: The fundamental question of how cells maintain their characteristic size remains open. Cell size measurements made through microscopic time-lapse imaging of microfluidic single cell cultivations have posed serious challenges to classical cell growth models and are supporting the development of newer, nuanced models that explain empirical findings better. Yet current models are limited, either to specific types of cells and/or to cell growth under specific microenvironmental conditions. Together with the fact that tools for robust analysis of said time-lapse images are not widely available as yet, the above-mentioned point presents an opportunity to progress the cell growth and size homeostasis discourse through generative, probabilistic modeling and analysis of the utility of different statistical estimation and inference techniques in recovering the parameters of the same. In this thesis, I present a novel Model Framework for simulating microfluidic single-cell cultivations with 36 different simulation modalities, each integrating dominant cell growth theories and generative modelling techniques. I also present a comparative analysis of how different Frequentist and Bayesian probabilistic inference techniques such as Nuisance Variable Elimination and Variational Inference work in the context of a case study of the estimation of a single model describing a microfluidic cell cultivation.
Published: 2022

37. Applying positivity constraints to q-space traj ectory imaging : The QTI plus implementation

Author: Deneb Boito, Magnus Herberthson, Tom Dela Haije, and Evren Özarslan
Subjects: Bioinformatics (Computational Biology), Bioinformatik (beräkningsbiologi), Diffusion MRI, QTI, Constrained estimation, Microscopic anisotropy, Software, Computer Science Applications
Abstract: Diffusion MRI is a powerful technique sensitive to the microstructure of heterogeneous media. By relating the dMRI signal obtained via general gradient waveforms to the moments of an underlying diffusion tensor distribution, q-space trajectory imaging (QTI) provides several quantities indicative of the structural composition of the medium. Substantial improvements in the reliability of the produced estimates has been achieved via incorporating necessary positivity constraints in the estimation by employing Semidefinite Programming. Here we present the Matlab code implementing said constraints, provide a simple example showing the main functionalities of the package, and point to resources within the package that can be used to reproduce results recently published with this software. The block-based structure of our implementation allows the selection of steps to be performed, and facilitates the incorporation of new constraints in future releases. Funding Agencies|Linkoping University Center for Industrial Information Technology (CENIIT), Sweden [VINNOVA/ITEA3 17021 IMPACT]; Analytic Imaging Diagnostic Arena (AIDA); Swedish Foundation for Strategic ResearchSwedish Foundation for Strategic Research [RMX18-0056]; VILLUM FONDEN, DenmarkVillum Fonden [00028384]
Published: 2022
Full Text: View/download PDF

38. Metagenomic analysis of Crohn’s Disease

Author: Lennemyr Ahlström, Gustav
Subjects: Data Analysis, Crohn's Sjukdom, Bioinformatics (Computational Biology), Bioinformatics, IBD, MetaPhlAn3, Bioinformatik, Crohn's Disease, Shotgun Sequencing Data, Supervised ML, Machine Learning, AI, Bioinformatik (beräkningsbiologi), Mikrobiom, Metagenomics, Microbiome, Human Health
Abstract: Inflammatory Bowel Disease (IBD) is a chronic and incurable condition that is increasing inprevalence across the globe. This illness consist of two forms: Crohn’s Disease (CD) andUlcerative Colitis (UC). CD is characterised by a patch inflammation pattern across the gut anda multitude of different factors, such as diet. Contemporary research has found a link betweengut dysbiosis and the development of IBD, suggesting that the microbial flora colonising the guthave a vital part to play in the development of CD.This paper aims to identify taxa associated with CD. This is done through the application ofmachine learning algorithms as standard univariate statistical methods fail to apply in the highlyinterdependent domain of the gut microbiome. The compositionally of the data and externalfactors influencing variance in the data will be taken into account.After applying a Center Log ratio transformation (CLR) to a MetaPhlAn3 taxonomic profile andusing a random forest classifier the following five taxa were identified as the most important inthe association to CD: Ruminococcaceae bacterium, Akkermansia muciniphila, Streptococcusparasanguinis, Flavonifractor plautii and Bifidobacterium bifidum.
Published: 2022

39. Insight into the evolution of the genus Mycobacterium

Author: Behra, Phani Rama Krishna
Subjects: Evolutionsbiologi, Evolutionary Biology, Bioinformatics (Computational Biology), Mycobacterial genomes, tRNA and non-coding RNA, Cell- och molekylärbiologi, Bioinformatik (beräkningsbiologi), core gene phylogeny, Cell and Molecular Biology
Abstract: The genus Mycobacterium includes more than 190 species, and many cause severe diseases such as tuberculosis and leprosy. According to the "World Health Organization", in year 2019 alone, 10 million people developed TB, and 1.4 million died. TB had been in decline in developed countries, but made its reappearance as an opportunistic pathogen targeting immuno-compromised AIDS victims. Also, non-tuberculosis mycobacteria (NTM) infections have emerged as a major infectious agent in recent times. NTM occupy diverse ecological niches and can be isolated from soil, tap water, and groundwater. This thesis has investigated the Mycobacterium species from a genomic perspective, focusing on the biology of virulence factors, mobile genetic elements, tRNAs, and non-coding RNAs and their evolutionary distribution and possible relationship with phenotypic diversity. As part of this study, we have sequenced 153 mycobacterial genomes, including type strains, environmental samples, isolates from hospital patients, infected fish, and outbreak samples in an animal facility at Uppsala University. We have provided a phylogenetic tree based on 387 (and 56) core genes covering most species (244 genomes) constituting the Mycobacterium genus. The core gene phylogeny resulted in 33 clades. Subsequently, we have covered different clade groups, such as, M. marinum, M. mucogenicum, M. chelonae and M. chlorophenolicum and investigated the NTM clade-specific genome diversity and evolution. Our examination of non-coding genes showed that the total number of tRNA genes per species varies between 42 and 90. Among the species with more than 50 tRNAs, additional tRNA genes are likely acquired through horizontal gene transfer (HGT), as supported by the presence of closely linked HNH endonuclease gene and GOLLD RNA. We have explored the presence of selenocysteine utility and the gene for selenoprotein "formate dehydrogenase" among 244 mycobacterial genomes. For the M. chlorophenolicum clade, we have explored genes with a role in the bioremediation process. Comparative genomics of M. marinum and M. chelonae clade groups suggest new clusters or subspecies. Mutational hotspots are relatively higher in M. marinum compared to that in M. tuberculosis and M. salmoniphilum. Relatively higher number of hotspots in M. marinum is likely related to its ability to occupy different ecological niches. Finally, the thesis uncovered IS elements, phage sequences, plasmids, tRNA, and ncRNA contributing to mycobacterial evolution.
Published: 2022

40. Identification of dynamic mass-action biochemical reaction networks using sparse Bayesian methods

Author: Richard Jiang, Prashant Singh, Fredrik Wrede, Andreas Hellander, Linda Petzold, and Mendes, Pedro
Subjects: Bioinformatics (Computational Biology), Ecology, QH301-705.5, Biochemical Phenomena, Bioinformatics, Beräkningsmatematik, Systems Biology, Uncertainty, Bayes Theorem, Biological Sciences, Mathematical Sciences, Cellular and Molecular Neuroscience, Computational Mathematics, Computational Theory and Mathematics, Modeling and Simulation, Information and Computing Sciences, Genetics, Bioinformatik (beräkningsbiologi), Generic health relevance, Biology (General), Molecular Biology, Ecology, Evolution, Behavior and Systematics
Abstract: Identifying the reactions that govern a dynamical biological system is a crucial but challenging task in systems biology. In this work, we present a data-driven method to infer the underlying biochemical reaction system governing a set of observed species concentrations over time. We formulate the problem as a regression over a large, but limited, mass-action constrained reaction space and utilize sparse Bayesian inference via the regularized horseshoe prior to produce robust, interpretable biochemical reaction networks, along with uncertainty estimates of parameters. The resulting systems of chemical reactions and posteriors inform the biologist of potentially several reaction systems that can be further investigated. We demonstrate the method on two examples of recovering the dynamics of an unknown reaction system, to illustrate the benefits of improved accuracy and information obtained. eSSENCE - An eScience Collaboration
Published: 2022

41. Pathway analysis: methods and perspectives

Author: Stolf Jeuken, Gustavo
Subjects: Bioinformatics (Computational Biology), transcriptomics, Bioinformatik (beräkningsbiologi), mutual information, enrichment analysis, pathway analysis, survival analysis
Abstract: The amount of data being generated by high throughput molecular biology experiments grows every day, both in quantity and quality. With this comes the desire to have more powerful and comprehensive methods for statistical analysis that have been developed with the nature of this data in mind. One of the lines of research that has been developed with this specific goal in mind is pathway analysis. Here, pathways are units of information that have been curated in a way that makes biological knowledge of cellular processes available in a programmatic way, and pathway analysis methods make use of this information to help understand the results of high throughput experiments. This is an exploratory thesis on the field of pathway analysis. I give a brief introduction to the field, what motivated its development, the problems it tries to solve, and some of the proposed statistical methods, together with some discussion on the implications of this type of analysis. I then present three original works on pathway analysis, each with a different perspective on the task. First, we present a more reliable null model for pathway analysis methods that use functional association networks, which results in better-calibrated statistics. Second, we show how we can combine pathway analysis methods with other statistical methods, such as survival analysis. We applied this method to a large breast cancer cohort and show that in this case pathways provide better prognostic power than individual genes. Third, we leverage concepts from information theory to design an original pathway analysis method that is very sensitive and flexible, while being practically without parameters. Together, all three papers contribute to furthering the field's usefulness and to the understanding of this type of analysis. Mängden data som genereras i storskaliga molekylärbiologiska experiment ökar stadigt, både i kvantitet och kvalitet. Som en konsekvens ökar behovet av kraftfullare och mer omfattande metoder för tolkning och statistisk analys av sådan data. En forskningsmetodik som försöker lösa problem associerade med den statistiska analysen utav stora blandade biologiska datamängder är pathway-analys (från engelskans pathway; gångväg eller sekvens av steg). En biologisk eller biomedicinsk pathway är en enhet av annoterad information, som har kurerats på ett sådant sätt att den representerar tidigare biologisk kunskap. Den programmatiskt tillgängliga informationen över rimliga kopplingar i den stora datamängden kan innefatta metabola processer, cellulär lokalisering eller biokemisk funktion. Den stora mängden pathways möjliggör sedan systematisk dataintegrering och ökad förståelse utav stora datamängder från hög-kapacitets experiment. I denna avhandling beskriver vi pathway-analys genom att först ge en kort introduktion till teknikerna, vad som motiverade dess utveckling, de problem pathway-analys försöker lösa och några av de föreslagna statistiska metoderna, tillsammans med en del diskussion om implikationerna av denna typ av analys. Jag presenterar sedan tre publikationer om pathway-analys, var och en med olika perspektiv på uppgiften. Först presenterar vi en mer tillförlitlig, graf baserad, statistisk null-modell för pathway-analysmetoder som bygger på funktionella associationsnätverk, vilket resulterar i bättre kalibrerad statistik. I den andra artikeln visar vi hur vi kan kombinera pathway-analysmetodik med andra statistiska metoder, såsom överlevnadsanalys. Vi tillämpade denna metod på en stor bröstcancerkohort och visar att i detta fall ger pathways bättre prognostisk kraft än enskilda gener. I den tredje artikeln utnyttjar vi begrepp från informationsteori för att designa en förbättrad pathway-analysmetodik, som är mycket känslig och flexibel, samtidigt som den är praktiskt taget utan parametrar. Tillsammans bidrar alla tre artiklarna till att öka fältets användbarhet och förståelsen för denna typ av analys. QC 2022-08-25
Published: 2022

42. Computational statistical methods for genotyping biallelic DNA markers from pooled experiments

Author: Clouard, Camille
Subjects: Bioinformatics (Computational Biology), Bioinformatik (beräkningsbiologi), Other Computer and Information Science, Annan data- och informationsvetenskap
Abstract: The information conveyed by genetic markers such as Single Nucleotide Polymorphisms (SNPs) has been widely used in biomedical research for studying human diseases, but also increasingly in agriculture by plant and animal breeders for selection purposes. Specific identified markers can act as a genetic signature that is correlated to certain characteristics in a living organism, e.g. a sensitivity to a disease or high-yield traits. Capturing these signatures with sufficient statistical power often requires large volumes of data, with thousands of samples to analyze and possibly millions of genetic markers to screen. Establishing statistical significance for effects from genetic variations is especially delicate when they occur at low frequencies. The production cost of such marker genotype data is thereforea critical part of the analysis. Despite recent technological advances, the production cost can still be prohibitive and genotype imputation strategies have been developed for addressing this issue. The genotype imputation methods have been widely investigated on human data and to a smaller extent on crop and animal species. In the case where only few reference genomes are available for imputation purposes, such as for non-model organisms, the imputation results can be less accurate. Group testing strategies, also called pooling strategies, can be well-suited for complementing imputation in large populations and decreasing the number of genotyping tests required compared to the single testing of every individual. Pooling is especially efficient for genotyping the low-frequency variants. However, because of the particular nature of genotype data and because of the limitations inherent to the genotype testing techniques, decoding pooled genotypes into unique data resolutions is a challenge. Overall, the decoding problem with pooled genotypes can be described as as an inference problem in Missing Not At Random data with nonmonotone missingness patterns. Specific inference methods such as variations of the Expectation-Maximization algorithm can be used for resolving the pooled data into estimates of the genotype probabilities for every individual. However, the non-randomness of the undecoded data impacts the outcomes of the inference process. This impact is propagated to imputation if the inferred genotype probabilities are to be devised as input into classical imputation methods for genotypes. In this work, we propose a study of the specific characteristics of a pooling scheme on genotype data, as well as how it affects the results of imputation methods such as tree-based haplotype clustering or coalescent models.
Published: 2022

43. Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data

Author: Fredrik Svensson, Johannes Kirchmair, Andrea Volkamer, Miriam Mathea, Andrea Morger, Marina Garcia de Lomana, and Ulf Norinder
Subjects: Machine Learning, Bioinformatics (Computational Biology), Multidisciplinary, Chemical toxicity, Drug discovery, Environmental chemistry, Calibration, Molecular Conformation, Bioinformatik (beräkningsbiologi), Environmental science, Biological Assay, Computational biology and bioinformatics
Abstract: Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.
Published: 2022

44. Bioinformatik, evolution och revolution för vår förståelse av toxin-antitoxinsystem

Author: Saha, Chayan Kumar
Subjects: Evolutionsbiologi, Bioinformatics (Computational Biology), Evolutionary Biology, Bioinformatics, Evolution, Bioinformatik (beräkningsbiologi), Toxin, Gene neighbourhood, Antitoxin, Alarmones
Abstract: Bacteria experience a wide range of natural challenges during their life cycles, to which they must respond and adapt to live. Under stressed conditions such as amino acid starvation, bacteria slow down their growth mechanism by producing small alarmone nucleotides guanosine pentaphosphate (pppGpp) and tetraphosphate (ppGpp), collectively referred to as (p)ppGpp. Accumulation of (p)ppGpp results in a comprehensive alteration in cellular metabolism. The alarmone (p)ppGpp is produced and degraded by enzymes belonging to the RelA/SpoT Homologue (RSH) protein family, named for their sequence similarity to the RelA and SpoT enzymes of Escherichia coli. The members of the RSH protein family can be classified into long multi-domain RSHs and short single-domain RSHs. Long RSH enzymes such as RelA and SpoT, carry both the (p)ppGpp hydrolysis (HD) domain and (p)ppGpp synthesis (SYNTH) domain in the N-terminal domain enzymatic region (NTD), in combination with additional domains in the C-terminal domain regulatory region (CTD). Short single-domain RSHs can be divided into small alarmone hydrolases (SAHs) carrying the HD domain, and small alarmone synthetases (SASs) that only have the SYNTH domain. At the beginning of my PhD, I studied the diversity of RSH proteins across the tree of life. I identified 35,615 RSH proteins from analyses of 24,072 genomes. I used large-scale phylogenetic analyses to classify the RSH proteins into 13 long RSHs, 11 SAHs and 30 SASs subfamilies. To address why bacteria often carry multiple SASs in the same genome and predict new functions, I developed a computational tool called FlaGs – standing for Flanking Genes – to analyse the conservation of genomic neighbourhoods in large datasets. I also developed a web-based version of FlaGs, called webFlaGs, that is publicly accessible and is used by biologists all over the world. The application of FlaGs to SAS RSHs led to the discovery that multiple SAS subfamilies are encoded in conserved and frequently overlapping two or three-gene architecture, reminiscent of toxin−antitoxin (TA) systems. Five SAS representatives from the FaRel, FaRel2, PhRel, PhRel2 and CapRel subfamilies were experimentally validated as toxins (toxSASs) that are neutralised by the products of six neighbouring antitoxin genes. The toxSAS enzyme FaRel from Cellulomonas marina is encoded as the central gene of a conserved three-gene architecture and acts through the production of nucleotide alarmone ppGpp and its unusual toxic analogue ppApp, causing a significant depletion of ATP and GTP. FaRel toxicity can be countered by both downstream and upstream cognate antitoxins. The latter contains a SAH domain that neutralises toxicity through degradation of ppGpp as well as ppApp. Combining phylogenetic and FlaGs analyses we have discovered that the DUF4065 domain of unknown function is a widely distributed antitoxin domain in putative TA-like operons with dozens of distinct toxic domains. Nine DUF4065-containing antitoxins and their cognate toxins were experimentally validated as TA pairs using toxicity neutralisation assays. These antitoxins form complexes with their diverse cognate toxins. Given the versatility of DUF4065, we have renamed this domain Panacea. We hypothesise that there are multiple hyperpromiscuous antitoxins like Panacea that can be associated with many non-homologous toxin domains, which may also be hyperpromiscuous. Thus, TA systems across all bacteria can be represented as a network of toxin and antitoxin domain combinations, with hyperpromiscuous domains being hubs in the network. To test this and compute a network, I developed a new, iterative version of FlaGs, called NetFlax (standing for Network-FlaGs for toxins and antitoxins), that can identify TA-like gene architectures in an unsupervised manner and generate a toxin-antitoxin domain interaction network. The results from NetFlax verify our strategy since we have rediscovered multiple previously characterised TAs as well as brand new ones. We find that Panacea is unusual but not unique in its hyperpromiscuity and our network indicates the presence of novel hyperpromiscuous domains still to be explored. Our findings also demonstrate how a core network of TAs is evolutionarily linked to various accessory genome systems, including conjugative transfer and phage defence mechanisms. The existing network can potentially be a framework on which future discoveries of the biological role of TAs can be mapped. Number in series missing in publication.
Published: 2022

45. Genomisk adaption och gendosreglering i bananflugeceller och utveckling av long-read-programvara

Author: Lewerentz, Jacob
Subjects: Nanopore, Bioinformatics (Computational Biology), software, Cellbiologi, cell line, bioinformatics, Cell Biology, genome evolution, long read, structural variant, Illumina, dosage compensation, Bioinformatik (beräkningsbiologi), cancer, Drosophila, Pacbio
Abstract: Cells are the vehicles that allows genetic code to proliferate in the world, taking on various forms – as illustrated by the tree of life. The cell features are determined by the manufacturing of proteins, a process that has its blueprints encoded as genes in the genome. It is crucial for all cells to have the right amount of protein, regardless of context (part of a multicellular organism or self-sustained). The protein landscape (amount and type) vary depending on the environment. Cells of the multicellular organism should maintain the protein balance to provide its’ intended function in the organism tissue. The cells of multicellular organisms are faced with an imbalance due to sex-related chromosomal imbalances and other genome effects that change the number of gene copies. Restoration from the imbalance is done by dosage compensation systems. Cells that are isolated from the organism and grown inside the lab are common in research, known as cell lines. Cancer cells are similar to cell lines and have lost their original function in the organism in favor of a self-sustained lifestyle. The new environment (context) for these isolated cells impose a challenge; the cells must reorganize their genomes (holding the blueprints for proteins) to obtain autonomy. In this thesis, the genome evolution of isolated cells, cell lines, has been studied using Drosophila melanogaster (the fruit fly). Compared to normal cells of the host organism, cell line genomes are highly mutated and rearranged. With the emergence of novel sequencing technologies that can read long fragments of the genome, this complexity of cell line genomes can be captured. On the topic of novel sequencing technologies, new software implementations are presented and the future of software for long reads and complex genomes is discussed. The main focus of this thesis is to describe how an established and commonly used cell line has reorganized its’ genome to sustain a culture environment. Important information about the genome structure is provided to the research community. The thesis also describes the genome reorganization in new cell lines, covering the early adaptations to cell autonomy. Together, these investigations are of high relevance to human cancer research. This thesis has also studied the fundamentals for regulation of protein balance in organismal cells. Specifically, a recognition sequence to the X chromosome of the protein Painting of Fourth. This protein is related to dosage compensation and primarily enhance transcription from the 4 thchromosome in Drosophila melanogaster, but has been observed tooccasionally bind to the X chromosome.
Published: 2022

46. Structure-based Virtual Screening for Ligands of G Protein-coupled Receptors : Design of Allosteric and Dual-Target Modulators

Author: Kampen, Stefanie
Subjects: Parkinson’s Disease, Bioinformatics (Computational Biology), Virtual Screening, G protein-coupled receptors, Polypharmacology, Allosteric Modulators, Bioinformatik (beräkningsbiologi), Structure-based Drug Design, Molecular Docking
Abstract: G protein-coupled receptors (GPCRs) are integral membrane proteins responsible for signal transduction of extracellular stimuli into the cell. Because of their widespread distribution throughout the human body and important roles in physiological processes, GPCRs are prominent drug targets and approximately 34% of all approved drugs interact with members of this superfamily. GPCR ligands are used as drugs against various diseases, including neurodegenerative and neuropsychiatric disorders. The increased availability of GPCR structural information has enhanced understanding of GPCR function but also enables structure-based drug design (SBDD). This thesis focuses on SBDD targeting allosteric and orthosteric binding sites of GPCRs and strategies to identify multi-target ligands. Drug discovery campaigns are traditionally based on the one-target-one-drug paradigm, but effective treatment of complex neurological disorders generally requires modulation of several signaling pathways. In publication I, dual-target ligands that activate the D2 dopamine receptor (D2R) and antagonize the A2A adenosine receptor (A2AAR) were designed through a structure-based approach. Both GPCRs are relevant for Parkinson’s disease (PD) and animal studies support that interactions with these targets induce neuroprotection while eliciting a synergistic therapeutic effect. One of the designed ligands was shown to yield an antiparkinsonian effect in a rodent model. Publication II focuses on the identification of negative allosteric modulators (NAMs) of the metabotropic glutamate receptor 5 (mGlu5) using structure-based virtual screening. Such modulators have been considered as a treatment of PD, fragile X syndrome and depression. The study discovered 11 allosteric modulators and four of these were also shown to be NAMs of mGlu5. Manuscript III describes the development of dual-target ligands acting as antagonists of the A2AAR and NAMs of mGlu5. Blocking the activity of both receptors has been shown to have a synergistic antiparkinsonian effect that could be both symptomatic and neuroprotective. In this study, virtual screening was used to discover drug-like compounds with submicromolar binding affinity to both targets. Publication IV presents a comprehensive review of SBDD targeting GPCRs of all classes with a specific focus on the method of molecular docking. Publication V describes a program for automatic validation of X-ray crystal structures. Possible applications involve assessment of protein structures used in SBDD or the generation of high-quality test sets for the evaluation of molecular docking methods. The results of this thesis illustrate that structure-based virtual screening is a versatile tool to discover ligands with tailored pharmacological properties.
Published: 2022

47. A time-causal and time-recursive scale-covariant scale-space representation of temporal signals and past time

Author: Lindeberg, Tony
Subjects: Signal, The present, Delay, Bioinformatics (Computational Biology), Matematik, Time-recursive, Wavelet analysis, Theoretical neuroscience, Temporal, Time-causal, Scale space, Time, Scale, Time-frequency analysis, Scale covariance, Memory, Datorseende och robotik (autonoma system), Theoretical biology, Bioinformatik (beräkningsbiologi), Perceptual agent, Mathematics, Computer Vision and Robotics (Autonomous Systems)
Abstract: This article presents an overview of a theory for performing temporal smoothing on temporal signals in such a way that: (i) temporally smoothed signals at coarser temporal scales are guaranteed to constitute simplifications of corresponding temporally smoothed signals at any finer temporal scale (including the original signal) and (ii) the temporal smoothing process is both time-causal and time-recursive, in the sense that it does not require access to future information and can be performed with no other temporal memory buffer of the past than the resulting smoothed temporal scale-space representations themselves. For specific subsets of parameter settings for the classes of linear and shift-invariant temporal smoothing operators that obey this property, it is shown how temporal scale covariance can be additionally obtained, guaranteeing that if the temporal input signal is rescaled by a uniform temporal scaling factor, then also the resulting temporal scale-space representations of the rescaled temporal signal will constitute mere rescalings of the temporal scale-space representations of the original input signal, complemented by a shift along the temporal scale dimension. The resulting time-causal limit kernel that obeys this property constitutes a canonical temporal kernel for processing temporal signals in real-time scenarios when the regular Gaussian kernel cannot be used, because of its non-causal access to information from the future, and we cannot additionally require the temporal smoothing process to comprise a complementary memory of the past beyond the information contained in the temporal smoothing process itself, which in this way also serves as a multi-scale temporal memory of the past. We describe how the time-causal limit kernel relates to previously used temporal models, such as Koenderink's scale-time kernels and the ex-Gaussian kernel. We do also give an overview of how the time-causal limit kernel can be used for modelling the temporal processing in models for spatio-temporal and spectro-temporal receptive fields, and how it more generally has a high potential for modelling neural temporal response functions in a purely time-causal and time-recursive way, that can also handle phenomena at multiple temporal scales in a theoretically well-founded manner. We detail how this theory can be efficiently implemented for discrete data, in terms of a set of recursive filters coupled incascade. Hence, the theory is generally applicable for both: (i) modelling continuous temporal phenomena over multiple temporal scales and (ii)digital processing of measured temporal signals in real time. We conclude by stating implications of the theory for modelling temporal phenomena in biological, perceptual, neural and memory processes by mathematical models, as well as implications regarding the philosophy of time and perceptual agents. Specifically, we propose that for A-type theories of time, as well as for perceptual agents, the notion of a non-infinitesimal inner temporal scale of the temporal receptive fields has to be included in representations of the present, where the inherent non-zero temporal delay of such time-causal receptive fields implies a need for incorporating predictions from the actual time-delayed present in the layers of a perceptual hierarchy, to make it possible for a representation of the perceptual present to constitute a representation of the environment with timing properties closer to the actual present. QC 20220926 Scale-space theory for covariant and invariant visual perception
Published: 2022

48. Enhanced Aquila optimizer algorithm for global optimization and constrained engineering problems

Author: Huangjing Yu, Heming Jia, Jianping Zhou, and Abdelazim G. Hussien
Subjects: Computational Mathematics, Bioinformatics (Computational Biology), Aquila optimizer, AO, restart strategy, opposition-based, chaotic local search, Applied Mathematics, Modeling and Simulation, Bioinformatik (beräkningsbiologi), General Medicine, General Agricultural and Biological Sciences
Abstract: The Aquila optimizer (AO) is a recently developed swarm algorithm that simulates the hunting behavior of Aquila birds. In complex optimization problems, an AO may have slow convergence or fall in sub-optimal regions, especially in high complex ones. This paper tries to overcome these problems by using three different strategies: restart strategy, opposition-based learning and chaotic local search. The developed algorithm named as mAO was tested using 29 CEC 2017 functions and five different engineering constrained problems. The results prove the superiority and efficiency of mAO in solving many optimization issues. Funding Agencies|Educational research projects of young and middle-aged teachers in Fujian Province [JAT200648]; Fujian Natural Science Foundation Project [2021J011128]; Digital Fujian Research Institute for Industrial Energy Big Data; Fujian Province University Key Lab for Industry Big Data Analysis and Application; Fujian Key Lab of Agriculture IOT Application; IOT Application Engineering Research Center of Fujian Province Colleges and Universities; Sanming City 5G Innovation Laboratory
Published: 2022

49. The Binding Mechanism of Carbapenems in the Class A beta-lactamase IMI-1 : A Molecular Dynamics Study of Ligand Stability

Author: Lindahl, Isabell
Subjects: Fysikalisk kemi, Other Chemical Engineering, Bioinformatics (Computational Biology), antibiotic resistance, biapenem, Beräkningsmatematik, β-lactamase, active site, free energy, Physical Chemistry, betalactamase, Carbapenemase, Computational Mathematics, meropenem, IMI-1, Bioinformatik (beräkningsbiologi), linear interaction energy, LIE, imipenem, Annan kemiteknik
Abstract: Antibiotic resistance is a global and accelerating matter. Over time, the bacteria have evolved several defense mechanisms against the antibiotics. One of the defense mechanisms is that the bacteria can produce enzymes with the ability to hydrolyze the characteristic b-lactam ring of the antibiotics. These enzymes are called b-lactamases. There are three different generations of antibiotics clinically available, and b-lactamases have co-evolved with the antibiotics over the generations. The third generation of antibiotics are called the carbapenems and b-lactamases which hydrolyze carbapenems are called carbapenemases. Carbapenemases are promiscuous, which means that they hydrolyze a variety of antibiotics. The b-lactamase IMI-1 is an imipenem-hydrolyzing enzyme and imipenem is a carbapenem, hence IMI-1 is a carbapenemase. In this project, IMI-1 was investigated in complex with the carbapenems imipenem, meropenem and biapenem using computational methods. More specifically, a homology model of IMI-1 was generated and the carbapenems were docked into the model. The system was then used for MD simulations where the important molecular interactions were identified, and the binding free energies were calculated using the LIE method. The results indicate that IMI-1 has flexible loops that enables an open and a closed conformation of IMI- 1. All three carbapenems were docked and simulated in both conformations of IMI-1. The results indicate that open and closed conformations confirms the promiscuity of carbapenemases since the flexibility enables various initial binding mechanisms. in other words, the hydrolysis may occur so quickly that the binding does not have much bearing of the activity of the enzyme. Furthermore, the calculated binding free energies indicate that IMI-1 is optimized for the catalytic process rather than the binding affinity. In conclusion, IMI-1 and similar systems requires further research using computational methods to counteract antibiotic resistance based on knowledge.
Published: 2022

50. Iterative full-genome phasing and imputation using neural networks

Author: Rydin, Lotta
Subjects: Bioinformatics (Computational Biology), Machine learning, Genotype data, Bioinformatik (beräkningsbiologi), Quantitative Biology::Populations and Evolution, Convolutional neural networks, Quantitative Biology::Genomics, U-Net
Abstract: In this project, a model based on a convolutional neural network have been developed with the aim of imputing missing genotype data. This model was based on an already existing autoencoder that was modified into a U-Net structure. The network was trained and used iteratively with the intention that the result would improve in each iteration. In order to do this, the output of the model was used as the input in the next iteration. The data used in this project was diploid genotype data, which was phased into haploids and then run separately through the network. In each iteration, the new haploids were generated based on the output haploids. These were used as in input in the next iteration. The result showed that the accuracy of the imputation improved slightly in every iteration. However, it did not surpass the same model that was trained for one single iteration. Further work is needed to make the model more useful.
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Journal

Database

Publisher

376 results on '"Bioinformatik (beräkningsbiologi)"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources