Descriptor: "Bioinformatics (Computational Biology)" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bioinformatics (Computational Biology)"' showing total 3,313 results

Start Over Descriptor "Bioinformatics (Computational Biology)"

3,313 results on '"Bioinformatics (Computational Biology)"'

1. Unlocking protein sequences : Advances in protein structure and ligand-binding site prediction

Author: Shenoy, Aditi and Shenoy, Aditi
Abstract: The protein sequence determines how it will fold into its unique three-dimensional structure. Once folded, proteins perform their functions by interacting with other proteins or molecules called ligands within the cell. Experimental determination of protein structure and function is tedious. Computational approaches aim to accurately predict the properties of proteins to complement experimental efforts of understanding biochemical mechanisms within the cell. This thesis introduces computational techniques that predict the structure of protein complexes and identify protein residues involved in interactions with common biomolecules, such as metal ions and nucleic acids, based on sequence information. AlphaFold, a method that predicted protein structure using sequence information with almost experimental accuracy, was a critical breakthrough that shaped the field of protein structure prediction. Subsequently, approaches such as FoldDock adapted the AlphaFold pipeline for dimer complexes. Paper I applies the FoldDock protocol to understand toxin-antitoxin systems. These protein complexes are highly evolutionary conserved, and high-confidence dimer predictions were generated. Paper II applies the FoldDock protocol to study protein-protein interactions in the human proteome. To verify the reliability of machine-learning-based computational methods, they must be tested on independent data different from the data used to train the method. Paper III involves generating and using a homology-reduced independent test set to benchmark the performance of protein complex structure predictors, including the recent AlphaFold release adapted for multi-chain proteins – AlphaFold-Multimer. A confidence score (pDockQ2) was proposed to estimate the quality of the interfaces within multimers. Paper I, Paper II and Paper III are associated with predicting and evaluating protein-protein interactions. Representation learning involves finding effective representations of input data to maxi
Published: 2024

2. Beyond GWAS : Novel Methods and Resources for Genetic Epidemiology

Author: Schmitz, Daniel and Schmitz, Daniel
Abstract: Since the first human genome assembly’s release, our knowledge of the genetic architecture of complex traits and diseases has grown steadily. Genome-wide association studies (GWAS) played a major role but are limited to common traits and single-nucleotide polymorphisms (SNPs). Technologies and resources like next-generation sequencing, Mendelian Randomization (MR), long-read sequencing and improved reference genomes enable the investigation of variants inaccessible to GWAS, such as copy number variations (CNVs), rare variants and variants in previously unresolved regions. In project I, we performed a GWAS of estradiol measurements using data from UK Biobank and quantified estradiol’s effect on bone mineral density (BMD) using MR. 14 loci were associated with estradiol levels in males, of which one was also significant in females and an additional female-specific locus. We found a significant effect of estradiol on BMD, confirming previous research of estrogen’s importance for skeletal health. In project II, we used the GWAS results from project I to investigate the effect of endogenous estradiol on breast, endometrial and ovarian cancer using MR. Estradiol was associated with ovarian cancer and nominally associated with estrogen receptor-positive breast cancer, demonstrating the effect of endogenous estrogen on cancer risk. In project III, we quantified the effect of 184,182 CNVs on 438 blood plasma proteins using whole-genome sequencing (WGS) data from a Northern Swedish cohort and validated our findings using long-read sequencing in a subcohort. 15 CNVs were associated with 16 proteins of which four could be validated using long reads and three more were more complex variation. Our findings show the effects of CNVs on the plasma proteome and highlight the application different sequencing technologies for CNV detection. In project IV, we evaluated the use of T2T-CHM13 as reference for the SweGen cohort. Compared to GRCh38, mapping quality improved and we identifie
Published: 2024

3. Computational Models of Spatial Transcriptomes

Author: Bergenstråhle, Ludvig and Bergenstråhle, Ludvig
Abstract: Spatial biology is a rapidly growing field that has seen tremendous progress over the last decade. We are now able to measure how the morphology, genome, transcriptome, and proteome of a tissue vary across space. Datasets generated by spatial technologies reflect the complexity of the systems they measure: They are multi-modal, high-dimensional, and layer an intricate web of dependencies between biological compartments at different length scales. To add to this complexity, measurements are often sparse and noisy, obfuscating the underlying biological signal and making the data difficult to interpret. In this thesis, we describe how data from spatial biology experiments can be analyzed with methods from deep learning and generative modeling to accelerate biological discovery. The thesis is divided into two parts. The first part provides an introduction to the fields of deep learning and spatial biology, and how the two can be combined to model spatial biology data. The second part consists of four papers describing methods that we have developed for this purpose. Paper I presents a method for inferring spatial gene expression from hematoxylin and eosin stains. The proposed method offers a data-driven approach to analyzing histopathology images without relying on expert annotations and could be a valuable tool for cancer screening and diagnosis in the clinics. Paper II introduces a method for jointly modeling spatial gene expression with histology images. We show that the method can predict super-resolved gene expression and transcriptionally characterize small-scale anatomical structures. Paper III proposes a method for learning flexible Markov kernels to model continuous and discrete data distributions. We demonstrate the method on various image synthesis tasks, including unconditional image generation and inpainting. Paper IV leverages the techniques introduced in Paper III to integrate data from different spatial biology experiments. The proposed method can be use, Spatial biologi är ett snabbt växande forskningsområde som har sett en hög utvecklingstakt under det senaste decenniet. Vi kan idag mäta hur en vävnads morfologi, genom, transkriptom och proteom varierar i rummet. Dataset skapade av spatiala teknologier återspeglar komplexiteten i de system de mäter: De är multimodala, högdimensionella och är uppbyggda av ett intrikat nätverk av beroenden mellan biologiska strukturer som existerar på olika längdskalor. Som om denna komplexitet inte var nog, är mätningarna ofta både glesa och brusiga, vilket försvårar tolkningen av den underliggande biologiska signalen. I denna avhandling beskriver vi hur data från experiment inom spatial biologi kan analyseras med hjälp av djupinlärning och generativ modellering för att accelerera biologiska upptäckter. Avhandlingen är uppdelad i två delar. Den första delen ger en introduktion till fälten djupinlärning och spatial biologi, och hur dessa kan kombineras för att modellera data inom spatial biologi. Den andra delen består av fyra artiklar som beskriver metoder som vi har utvecklat för detta ändamål. Artikel I presenterar en metod för att skatta spatialt genuttryck från hematoxylin-eosin-färgningar. Den föreslagna metoden erbjuder ett datadrivet tillvägagångssätt för att analysera histopatologi-bilder utan användning av expertannoteringar och kan utgöra ett värdefullt verktyg för cancerscreening och diagnos i kliniken. Artikel II introducerar en metod för sammodellering av spatialt genuttryck och histologibilder. Vi visar att metoden kan användas för att predicera superupplöst genuttryck och transkriptionellt karakterisera småskaliga anatomiska strukturer. Artikel III beskriver en metod för modellering av kontinuerliga och diskreta datafördelningar med flexibla Markovkärnor. Vi demonstrerar metoden på olika bildgenereringsuppgifter, inklusive obetingad datagenerering och inpainting. Artikel IV utnyttjar teknikerna från Artikel III för att integrera data från olika experiment inom spatial, QC 2024-01-09
Published: 2024

4. Self-supervised deep learning and EEG categorization

Author: Svantesson, Mats and Svantesson, Mats
Abstract: Deep learning has the potential to be used to improve and streamline EEG analysis. At the present, classifiers and supervised learning dominate the field. Supervised learning depends on target labels which most often are created by human experts manually classifying data. A problem with supervised learning is intra- and interrater agreement which in some instances are far from perfect. This can affect the training and make evaluation more difficult. This thesis includes three papers where self-supervised deep neural networks were developed. In self-supervised learning, the input data to the networks themselves contain structures that are used as targets for the training and no labeling is necessary. In paper I, deep neural networks were trained to increase the number of-, or to recreate missing EEG-channels. The performance was at least on the same level as that of spherical interpolation, but unlike in the case of interpolation, missing data does not have to be identified manually first. Papers II and III involved developing deep neural networks for clustering analysis. The networks produced two-dimensional representations of EEG data and the training strategy was based on the principle of t-distributed stochastic neighbor embedding (t-SNE). In paper II, comparisons were made to parametric t-SNE and EEG-features obtained from time-frequency methods. The deep neural networks produced more distinct clustering when tested on data annotated for epileptiform discharges, seizure activity, or sleep-wakefulness. In paper III, the newly developed method was used to compare annotations of epileptiform discharges. Two experts performed independent annotations and classifiers were trained on these, using supervised learning, which in turn produced new annotations. The agreement when comparing two sets of annotations was not larger between the two experts than between an expert and a classifier. The analysis showed that differences in the annotations by the experts infl, Funding: The work was funded by Saad Nagi (grants RÖ-974228, RÖ-962769, and RÖ-941377), Magnus Thordstein (grant RÖ-986017), and Håkan Olaus-son (grants LIO-936017 and RÖ-941359).
Published: 2024
Full Text: View/download PDF

5. A computational and statistical framework for cost-effective genotyping combining pooling and imputation

Author: Clouard, Camille and Clouard, Camille
Abstract: The information conveyed by genetic markers, such as single nucleotide polymorphisms (SNPs), has been widely used in biomedical research to study human diseases and is increasingly valued in agriculture for genomic selection purposes. Specific markers can be identified as a genetic signature that correlates with certain characteristics in a living organism, e.g. a susceptibility to disease or high-yield traits. Capturing these signatures with sufficient statistical power often requires large volumes of data, with thousands of samples to be analysed and potentially millions of genetic markers to be screened. Relevant effects are particularly delicate to detect when the genetic variations involved occur at low frequencies. The cost of producing such marker genotype data is therefore a critical part of the analysis. Despite recent technological advances, production costs can still be prohibitive on a large scale and genotype imputation strategies have been developed to address this issue. Genotype imputation methods have been extensively studied in human data and, to a lesser extent, in crop and animal species. A recognised weakness of imputation methods is their lower accuracy in predicting the genotypes for rare variants, whereas those can be highly informative in association studies and improve the accuracy of genomic selection. In this respect, pooling strategies can be well suited to complement imputation, as pooling is efficient at capturing the low-frequency items in a population. Pooling also reduces the number of genotyping tests required, making its use in combination with imputation a cost-effective compromise between accurate but expensive high-density genotyping of each sample individually and stand-alone imputation. However, due to the nature of genotype data and the limitations of genotype testing techniques, decoding pooled genotypes into unique data resolutions is challenging. In this work, we study the characteristics of decoded genotype data from po
Published: 2024

6. Mathematical Modelling of Cerebral Metabolism : From Ion Channels to Metabolic Fluxes

Author: Sundqvist, Nicolas and Sundqvist, Nicolas
Abstract: The brain is the most metabolically active organ in the human body and therefore rely on a continuous supply of oxygen and glucose. Neuronal stimulation in specific regions of the leads to the firing of action potentials, a process facilitated by voltage-gated ion channels in the neurons’ cell membranes. This activation of the ion channels significantly elevates the brain’s metabolic energy demand, compelling neurons to ramp up their metabolic activity in response. Concurrently, this neuronal activation also initiates a signalling cascade that induces vasodilation and increases blood flow, thereby ensuring that regions with elevated neural activity are adequately supplied with oxygen and nutrients. This dynamic interplay between neuronal activity and cerebral blood flow (CBF) regulation constitutes the neurovascular coupling (NVC). The NVC is a cornerstone in interpreting functional Magnetic Resonance Imaging (fMRI) Blood Oxygen Level-Dependent (BOLD) responses. The BOLD response is an indirect, non-invasive, and highly sensitive indicator of neuronal activity, reflecting changes in blood oxygenation and flow associated with the neuronal and metabolic activity in the brain. By examining these responses, we can gain insights into the complex interactions between neuronal activity, energy metabolism, and CBF. Additionally, techniques such as 13C Metabolic Flux Analysis (13C MFA) makes it possible to gain further insight into the cerebral metabolism. This method enables a detailed examination of metabolic pathways and fluxes by tracking the incorporation of 13C-labelled substrates into various metabolites. By using 13C MFA, researchers can quantify the flow of substrates through metabolic networks, offering a deeper understanding of how cell such as neurons adapt their metabolism during different functional states and conditions. Central to exploring these multifaceted aspects of cerebral metabolism is the use of mathematical modelling and systems biology. These discip
Published: 2024
Full Text: View/download PDF

7. Identifying Graph Characteristics in Growing Vascular Networks

Author: Plummer, Christopher Finn and Plummer, Christopher Finn
Abstract: One of the ways that a vascular network grows is through the process of angiogenesis, wherebya new blood vessel forms as a branch from an existing vessel towards an area which isstimulating vascular growth. Due to the demands for nutrients and waste transport, growingtumour cells will access the surrounding vascular network by inducing angiogenesis. Once thetumour is connected with the vascular system it can grow further and colonize distant organs.Given the critical nature of this step in tumour development, there is a demand for mathematicaland computational models to provide an understanding of the process for treatment in predictivemedicine. These models allow us to generate vascular networks that demonstrate similarbehaviour to that of the observed networks; however, there is a lack of quantifiable measures ofsimilarity between generated networks, or, of a generated and real network. Furthermore, thereis not an established way to determine which measures hold the most relevance todistinguishing similarity. To construct such a measure we transform our generated vascularnetworks into an abstract graph representation which allows exploration of the plethora of graphcentralities. We propose to determine the relevance of a centrality by finding one that acts as asynthetic likelihood function for estimating the model's parameters with minimal error.Evaluating the relevance of many centralities, it is then possible to suggest which centralitiesshould be used to quantitatively determine similarity. This allows for a way to measure howrealistic a model's growth is, and if given sufficient data, to distinguish between regular andtumour-induced angiogenesis and use it within cancer screening.
Published: 2024

8. Genotyping of SNPs in bread wheat at reduced cost from pooled experiments and imputation

Author: Clouard, Camille, Nettelblad, Carl, Clouard, Camille, and Nettelblad, Carl
Abstract: The plant breeding industry has shown growing interest in using the genotype data of relevant markers for performing selection of new competitive varieties. The selection usually benefits from large amounts of marker data and it is therefore crucial to dispose of data collection methods that are both cost-effective and reliable. Computational methods such as genotype imputation have been proposed earlier in several plant science studies for addressing the cost challenge. Genotype imputation methods have though been used more frequently and investigated more extensively in human genetics research. The various algorithms that exist have shown lower accuracy at inferring the genotype of genetic variants occurring at low frequency, while these rare variants can have great significance and impact in the genetic studies that underlie selection. In contrast, pooling is a technique that can efficiently identify low-frequency items in a population and it has been successfully used for detecting the samples that carry rare variants in a population. In this study, we propose to combine pooling and imputation, and demonstrate this by simulating a hypothetical microarray for genotyping a population of recombinant inbred lines in a cost-effective and accurate manner, even for rare variants. We show that with an adequate imputation model, it is feasible to accurately predictthe individual genotypes at lower cost than sample-wise genotyping and time-effectively. Moreover, we provide code resources for reproducing the results presented in this study in the form of a containerized workflow., eSSENCE - An eScience Collaboration
Published: 2024
Full Text: View/download PDF

9. Metagenomic analysis of Mesolithic chewed pitch reveals poor oral health among stone age individuals

Author: Kirdök, Emrah, Kashuba, Natalija, Damlien, Hege, Manninen, Mikael A., Nordqvist, Bengt, Kjellström, Anna, Jakobsson, Mattias, Lindberg, A. Michael, Storå, Jan, Persson, Per, Andersson, Björn, Aravena, Andrés, Götherström, Anders, Kirdök, Emrah, Kashuba, Natalija, Damlien, Hege, Manninen, Mikael A., Nordqvist, Bengt, Kjellström, Anna, Jakobsson, Mattias, Lindberg, A. Michael, Storå, Jan, Persson, Per, Andersson, Björn, Aravena, Andrés, and Götherström, Anders
Abstract: Prehistoric chewed pitch has proven to be a useful source of ancient DNA, both from humans and their microbiomes. Here we present the metagenomic analysis of three pieces of chewed pitch from Huseby Klev, Sweden, that were dated to 9,890-9,540 before present. The metagenomic profile exposes a Mesolithic oral microbiome that includes opportunistic oral pathogens. We compared the data with healthy and dysbiotic microbiome datasets and we identified increased abundance of periodontitis-associated microbes. In addition, trained machine learning models predicted dysbiosis with 70-80% probability. Moreover, we identified DNA sequences from eukaryotic species such as red fox, hazelnut, red deer and apple. Our results indicate a case of poor oral health during the Scandinavian Mesolithic, and show that pitch pieces have the potential to provide information on material use, diet and oral health.
Published: 2024
Full Text: View/download PDF

10. Machine learning approaches to enhance diagnosis and staging of patients with MASLD using routinely available clinical information

Author: Mcteer, Matthew, Applegate, Douglas, Mesenbrink, Peter, Ratziu, Vlad, Schattenberg, Joern M., Bugianesi, Elisabetta, Geier, Andreas, Gomez, Manuel Romero, Dufour, Jean-Francois, Ekstedt, Mattias, Francque, Sven, Yki-Jarvinen, Hannele, Allison, Michael, Valenti, Luca, Miele, Luca, Pavlides, Michael, Cobbold, Jeremy, Papatheodoridis, Georgios, Holleboom, Adriaan G., Tiniakos, Dina, Brass, Clifford, Anstee, Quentin M., Missier, Paolo, Mcteer, Matthew, Applegate, Douglas, Mesenbrink, Peter, Ratziu, Vlad, Schattenberg, Joern M., Bugianesi, Elisabetta, Geier, Andreas, Gomez, Manuel Romero, Dufour, Jean-Francois, Ekstedt, Mattias, Francque, Sven, Yki-Jarvinen, Hannele, Allison, Michael, Valenti, Luca, Miele, Luca, Pavlides, Michael, Cobbold, Jeremy, Papatheodoridis, Georgios, Holleboom, Adriaan G., Tiniakos, Dina, Brass, Clifford, Anstee, Quentin M., and Missier, Paolo
Abstract: Aims Metabolic dysfunction Associated Steatotic Liver Disease (MASLD) outcomes such as MASH (metabolic dysfunction associated steatohepatitis), fibrosis and cirrhosis are ordinarily determined by resource-intensive and invasive biopsies. We aim to show that routine clinical tests offer sufficient information to predict these endpoints.Methods Using the LITMUS Metacohort derived from the European NAFLD Registry, the largest MASLD dataset in Europe, we create three combinations of features which vary in degree of procurement including a 19-variable feature set that are attained through a routine clinical appointment or blood test. This data was used to train predictive models using supervised machine learning (ML) algorithm XGBoost, alongside missing imputation technique MICE and class balancing algorithm SMOTE. Shapley Additive exPlanations (SHAP) were added to determine relative importance for each clinical variable.Results Analysing nine biopsy-derived MASLD outcomes of cohort size ranging between 5385 and 6673 subjects, we were able to predict individuals at training set AUCs ranging from 0.719-0.994, including classifying individuals who are At-Risk MASH at an AUC = 0.899. Using two further feature combinations of 26-variables and 35-variables, which included composite scores known to be good indicators for MASLD endpoints and advanced specialist tests, we found predictive performance did not sufficiently improve. We are also able to present local and global explanations for each ML model, offering clinicians interpretability without the expense of worsening predictive performance.Conclusions This study developed a series of ML models of accuracy ranging from 71.9-99.4% using only easily extractable and readily available information in predicting MASLD outcomes which are usually determined through highly invasive means., Funding Agencies|Newcastle University; Red Hat UK; LITMUS project - Innovative Medicines Initiative 2 Joint Undertaking [777377]; European Union's Horizon 2020 research and innovation programme; EFPIA; Newcastle NIHR Biomedical Research Centre.
Published: 2024
Full Text: View/download PDF

11. Reproducible mass spectrometry data processing and compound annotation in MZmine 3

Author: Heuckeroth, Steffen, Damiani, Tito, Smirnov, Aleksandr, Mokshyna, Olena, Brungs, Corinna, Korf, Ansgar, Smith, Joshua David, Stincone, Paolo, Dreolin, Nicola, Nothias, Louis-Félix, Hyötyläinen, Tuulia, Oresic, Matej, Karst, Uwe, Dorrestein, Pieter C., Petras, Daniel, Du, Xiuxia, van der Hooft, Justin J. J., Schmid, Robin, Pluskal, Tomáš, Heuckeroth, Steffen, Damiani, Tito, Smirnov, Aleksandr, Mokshyna, Olena, Brungs, Corinna, Korf, Ansgar, Smith, Joshua David, Stincone, Paolo, Dreolin, Nicola, Nothias, Louis-Félix, Hyötyläinen, Tuulia, Oresic, Matej, Karst, Uwe, Dorrestein, Pieter C., Petras, Daniel, Du, Xiuxia, van der Hooft, Justin J. J., Schmid, Robin, and Pluskal, Tomáš
Abstract: Untargeted mass spectrometry (MS) experiments produce complex, multidimensional data that are practically impossible to investigate manually. For this reason, computational pipelines are needed to extract relevant information from raw spectral data and convert it into a more comprehensible format. Depending on the sample type and/or goal of the study, a variety of MS platforms can be used for such analysis. MZmine is an open-source software for the processing of raw spectral data generated by different MS platforms. Examples include liquid chromatography-MS, gas chromatography-MS and MS-imaging. These data might typically be associated with various applications including metabolomics and lipidomics. Moreover, the third version of the software, described herein, supports the processing of ion mobility spectrometry (IMS) data. The present protocol provides three distinct procedures to perform feature detection and annotation of untargeted MS data produced by different instrumental setups: liquid chromatography-(IMS-)MS, gas chromatography-MS and (IMS-)MS imaging. For training purposes, example datasets are provided together with configuration batch files (i.e., list of processing steps and parameters) to allow new users to easily replicate the described workflows. Depending on the number of data files and available computing resources, we anticipate this to take between 2 and 24 h for new MZmine users and nonexperts. Within each procedure, we provide a detailed description for all processing parameters together with instructions/recommendations for their optimization. The main generated outputs are represented by aligned feature tables and fragmentation spectra lists that can be used by other third-party tools for further downstream analysis., Study Protocol.T.P. is supported by the Czech Science Foundation (GA CR) grant 21-11563M and by the European Union's Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement no. 891397. T.D. is supported by the European Regional Development Fund, Programme Johannes Amos Comenius project 'IOCB MSCA PF Mobility' no. CZ.02.01.01/00/22_010/0002733. C.B. is supported by the Czech Academy of Sciences Program to Support Prospective Human Resources. A.S. and X.D. are supported by the National Institutes of Health grant U01CA235507. P.C.D. is supported by R01GM107550, R03OD034493, R01DK136117 and NSF 2152526.
Published: 2024
Full Text: View/download PDF

12. The Evolutionary History of Picozoa : Phylogenomic inquiries into the plastid-lacking Archaeplastids

Author: Wanntorp, Matias and Wanntorp, Matias
Published: 2024

13. Using ADME/PK models to improve generative molecular design with reinforcement learning

Author: Pop, Cristian-Catalin and Pop, Cristian-Catalin
Abstract: An adequate ADME/PK (absorption, distribution, metabolism, excretion, pharmacokinetics) profile is an essential quality for a drug. As part of the drug discovery process, leads are iteratively designed and optimized in order to simultaneously satisfy various properties such as appropriate ADME/PK levels and high biological activity for a target. The drug discovery process can be accelerated by improving the likelihood that a designed compound fulfils the necessary pharmacologic properties, and thus reducing the number of needed iterations. A promising technique is de novo drug design, where molecules are computationally generated based on a set of desired attributes. Our project aimed to benchmark the effectiveness of the ANDROMEDA ADME/PK conformal prediction models in guiding the generation of compounds toward an area of chemical space with good ADME/PK properties. For this, we used the REINVENT reinforcement learning framework built by the Molecular AI team at AstraZeneca. Here, we integrated 4 out the 14 available ANDROMEDA models (fabs , fdiss, CLint and Vss) as oracles in the scoring component of the generative model. Oral bioavailability (F) is a secondary parameter that was computed with the help of the aforementioned models and fu(unbound fraction in plasma), and serves as the fifth ADME/PK oracle in our analysis. We aimed to rediscover DRD2 bioactives with a good ADME/PK profile. Our results show that the ANDROMEDA models have a slight influence on the predicted ADME/PK properties of the generated compounds. The results do not show an increased likelihood of generating DRD2 ligands in the case of the primary ANDROMEDA models. However, when using the oral bioavailability oracle, the sampling likelihood increases for some of the approved DRD2 ligands. In conclusion, the oral bioavailability ANDROMEDA model can be a promising option for guiding the generation of novel compounds towards an area of chemical space with good ADME/PK properties.
Published: 2024

14. Investigating the impact of dose banding and oral formulations of paracetamol in pediatrics: A pharmacokinetic simulation-based safety assessment study

Author: Rosenqvist, Julia and Rosenqvist, Julia
Abstract: Paracetamol är ett vanligt använt läkemedel med analgesisk och antipyretisk effekt. Läkemedlet finns tillgängligt i ett flertal beredningsformer och doseringsstyrkor för användning både receptfritt och i sjukhusvården. Syftet med detta projekt var att undersöka påverkan av alternativ, off-label, dosering av paracetamol i pediatrisk vård, med hjälp av fysiologiskt baserad farmakokinetisk (PBPK) modellering. Modellen utvecklades först för en vuxen population genom integrering av in vitro, in vivo och in silico data för paracetamol. Efter detta extrapolerades concentrationskurvor till en pediatrisk population med hjälp av ontogeni-information. Modellen validerades i både vuxna och barn, och var tillförlitlig för både peroral och intravenös dosering. Efter valideringen utfördes simuleringar för nio olika åldersgrupper baserat på rekommenderade doseringsprotokoll i Sverige. Simuleringarna visade att perorala tablettdoseringen var jämförbar med formulering i lösningsform, med snarlika maximumkoncentrationer och area-under-kurvan (AUC) för exponering. Hastigheten av magtömning influerade maximumkoncentrationer men inte AUC. Ytterligare testades modellens förmåga att prediktera plasmakoncentrationer i blodet efter överdosering med paracetamol. Dessa prediktioner fungerade bättre när läkemedelsmetaboliserande enzymer lämnades oförändrade, eller ökade något i aktivitet. Slutligen, den utvecklade PBPK-modellen kan användas för att säkert undersöka olika doseringsprotokoll och för design av pediatriska kliniska studier., Paracetamol, a widely used analgesic and antipyretic drug, can be found in various formulations and doses for both home and hospital use. The aim of this study was to investigate the impact of off-label dosing of paracetamol in pediatric clinical practice using physiologically based pharmacokinetic (PBPK) modeling. The model was initially developed for adults by integrating relevant in vitro, in vivo and in silico data of paracetamol, after which the model was extrapolated for pediatrics by adding ontogeny information. The model was successfully validated in both adult and pediatric populations, and it showed accuracy for both oral and intravenous administration routes. After validation, simulations were conducted across nine different age groups following the recommended doses in Sweden. These simulations showed that tablet dose is comparable to solution dosing, resulting in nearly identical maximum concentrations and area under the curve (AUC) values. Furthermore, it was observed that gastric emptying time, which reflects the fed state of individuals, significantly influences the maximum concentration, with longer gastric emptying times resulting in lower and delayed peak concentrations. However, the gastric emptying time had no effect on the AUC values. Lastly, the model’s performance on overdose data was evaluated, and it turned out that it performs better when liver enzymes were not affected, or they were only slightly elevated. Finally, the developed PBPK model can be further used for safe and effective way of exploring dose banding and designing clinical trials in pediatrics.
Published: 2024

15. Protein-drug binding affinity prediction with machine learning : Assessing the impact of features from molecular dynamic simulations

Author: Guttormsson, Guðmundur Andri, Le Gallo, Léa, Guttormsson, Guðmundur Andri, and Le Gallo, Léa
Abstract: The development of medicine is generally a long and costly process, and one big factor is estimating the affinity of protein-drug binding. Leveraging machine learning in this field is a promising approach as it can streamline the prediction process and reduce the need for expensive experimental methods. Machine learning methods have already enabled significant advances in predicting protein-drug binding affinity, yet there remains room for improvement. The primary challenge is the quality of data used for these machine learning models. In this work, two ensemble machine learning models, Random Forest and Extreme Gradient Boosting Machine, have been tested and compared with a recent database of protein-ligand complex features calculated from molecular dynamics simulation. Additional features were also extracted from the PDB database through PLIP (Protein-Ligand interaction Profiler), aiming to improve the predictions further. The results indicate that while the features from the PDB database provided strong predictive power, including features from molecular dynamic simulations did not improve the models’ performance.
Published: 2024

16. Exploring the performance of Conformal Prediction on Chemical Properties and Its Influencing Factors

Author: Chen, Yuhang and Chen, Yuhang
Abstract: Machine learning has gained much attention and extended to the field of drug discovery. However, due to the uncertainties of the dataset, predictions should be quantitatively analyzed. Conformal prediction is a powerful method for quantifying these uncertainties, generating a predefined confidence level and a corresponding interval within which the true target is anticipated to fall. This paper aims to explore the effects of different chemical representations of SMILES structures for training (chemical descriptors, Morgan fingerprints), machine learning algorithms (k-nearest neighbor, support vector machine, random forest, extreme gradient boosting, and artificial neural network), and different normalization methods (k-nearest neighbor, Mondrian regression) in influencing the conformal prediction results. We find that Morgan fingerprint outperforms chemical descriptors, Mondrian regression outperforms knearest neighbor for one or several values of coverage, and the mean, median, and standard deviation of the output interval. None of the investigated machine learning methods extremely outperforms the other methods. Conformal predictive system, an alternative form of conformal prediction was also investigated to explore its usefulness in drug discovery.
Published: 2024

17. The adaptive potential of effectively small and shrinking populations

Author: Eriksson, Leonora and Eriksson, Leonora
Abstract: It is well known that genetic variation and ability to adapt is crucial for the survival of anypopulation. Whether it be about a natural population’s ability to respond to changes in itsenvironment, or a population of livestock’s ability to produce more milk, genetic variation is akey element. Effectively small populations have an increased risk of extinction caused byreduced ability to adapt or respond to selection. Small populations are also more affected bygenetic drift, which can cause deleterious mutations to fixate, reducing the populations’ fitnesspossibly to the point where it is unable to survive. Models describing changes in allelefrequencies in a population under selection can be used to study a population’s response toselection. A limitation to such models is they often assume infinite population size and neglectthe effects of genetic drift, making them unable to implement when working with effectivelysmall populations.Here, an individual-based model of a quantitative trait affected by selection, mutation andgenetic drift is used to study the adaptive potential of effectively small populations. In a series ofsimulations, changes in the trait are explored under directional selection and stabilizingselection with adaptation to one, and several repeated shifts in optimum. Results of simulationinclude that populations under strong directional selection, such as breeding, potentially risklosing all adaptive potential. Results also suggest that effects of strong directional selectionmight be irreversible, even if the strong selective pressure is removed.
Published: 2024

18. Data Deconvolution for Drug Prediction

Author: Menacher, Lisa Maria and Menacher, Lisa Maria
Abstract: Treating cancer is difficult as the disease is complex and drug responses often depend on the patient's characteristics. Precision medicine aims to solve this by selecting individualized treatments. Since this involves the analysis of large datasets, machine learning can be used to make the drug selection process more efficient. Traditionally, such models utilize bulk gene expression data. However, this potentially masks information from small cell populations and fails to address tumor heterogeneity. Therefore, this thesis applies data deconvolution methods to bulk gene expression data and estimates the corresponding cell type-specific gene expression profiles. This "increases" the resolution of the input data for the drug response prediction. A hold-out dataset, LODOCV and LOCOCV were used for the evaluation of this approach. Furthermore, all results are compared against a baseline model, which was trained on bulk data. Overall, the accuracy of the cell type-specific model did not show an improvement compared to the bulk model. It also prioritizes information from bulk samples, which makes the additional data unnecessary. The robustness of the cell type-specific model is slightly lower than that of the bulk model. Note, that these outcomes are not necessarily due to a flaw in the underlying concept, but may be connected to poor deconvolution results as the same reference matrix was used for the deconvolution of all bulk samples regardless of the cancer type or disease.
Published: 2024

19. Enhancing carbon fixation in Rubisco through generative modelling

Author: Shute, Ellen and Shute, Ellen
Abstract: Kolavskiljning, avlägsnande av koldioxid (CO2) från atmosfären, har fått uppmärksamhet som en metod för att mildra effekterna av den globala uppvärmningen. Växter och fototrofa mikroorganismer har den inneboende förmåganatt fånga upp kol genom fixering av CO2 för att producera biomassa. Däremot inhemska kolfixeringsvägar begränsas av nyckelenzymer med låg katalytisk aktivitet vilket resulterar i låg energieffektivitet. Rubisco är en sådan nyckelenzym, ökänt för sin dåliga prestanda. Tidigare forskning har misslyckats när det gäller att förbättra kolet fixering i Rubisco med konventionella metoder. Generativ modellering har dykt upp som en innovativ förhållningssätt till enzymteknik, dra fördel av olika arkitekturer för neurala nätverk för att föreslå en ny varianter med önskade egenskaper. Här tränas en variationsautokodare (VAE) på Rubisco-sekvensen utrymme användes för utmaningen med Rubiscos ingenjörskonst. Två modeller utbildades och med hjälp av dimensionsreduktionsegenskapen hos VAE, utforskades fitnesslandskapet i Rubisco. Sekvenser var märkt med katalytiskt relevanta data och en regressionsmodell byggdes med syftet att förutsäga dessa sekvenser med ökad katalytisk aktivitet. Nya Rubisco-sekvenser genererades efter systematiska utfrågning av det lågdimensionella rummet. Användningen av generativ modellering här ger ett nytt perspektiv på Rubisco engineering., Carbon capture, the removal of carbon dioxide (CO2) from the atmosphere, has gained attention as a method to mitigate the effects of global warming. Plants and phototrophic microorganisms have the inherent ability to capture carbon through the fixation of CO2 to produce biomass. However, native carbon fixing pathways are limited by key enzymes with low catalytic activity resulting in low energy efficiency. Rubisco is one such key enzyme, notorious for its poor performance. Past research has been unsuccessful at enhancing carbon fixation in Rubisco through conventional methods. Generative modelling has emerged as an innovative approach to enzyme engineering, taking advantage of different neural network architectures to propose novel variants with desired characteristics. Here, a variational autoencoder (VAE) trained on the Rubisco sequence space was applied to the challenge of Rubisco engineering. Two models were trained and, using the dimensionality reduction property of VAEs, the fitness landscape of Rubisco was explored. Sequences were labelled with catalytically relevant data and a regression model was built with the aim of predicting those sequences with enhanced catalytic activity. Novel Rubisco sequences were generated following systematic interrogation of the low-dimensional space. The use of generative modelling here provides a fresh perspective on Rubisco engineering.
Published: 2024

20. Do the receptive fields in the primary visual cortex span a variability over the degree of elongation of the receptive fields?

Author: Lindeberg, Tony and Lindeberg, Tony
Abstract: This paper presents the results of combining (i) theoretical analysis regarding connections between the orientation selectivity and the elongation of receptive fields for the affine Gaussian derivative model with (ii) biological measurements of orientation selectivity in the primary visual cortex to investigate if (iii) the receptive fields can be regarded as spanning a variability in the degree of elongation. From an in-depth theoretical analysis of idealized models for the receptive fields of simple and complex cells in the primary visual cortex, we established that the orientation selectivity becomes more narrow with increasing elongation of the receptivefields. Combined with previously established biological results, concerning broad vs. sharp orientation tuning of visual neurons in the primary visual cortex, as well as previous experimental results concerning distributions of the resultant of the orientation selectivity curves for simple and complex cells, we show that these results are consistent with the receptive fields spanning a variability over the degree of elongation of the receptive fields. We also show that our principled theoretical model for visual receptive fields leads to qualitatively similar types of deviations from a uniform histogram of the resultant descriptor of the orientation selectivity curves for simple cells, as can be observed in the results from biological experiments. To firmly determine if the underlying working hypothesis, regarding the receptive fields spanning a variability in the degree of elongation, would truly hold for the receptive fields in the primary visual cortex of higher mammals, we formulate a set of testable predictions, that can be used for investigate this property experimentally, and, if applicable, then also characterize if such a variability would, in a structured way, be related to the pinwheel structure in the visual cortex., QC 20240410
Published: 2024

21. Inequality relations for NMR-based polymer homoblock analysis and extended application: Reanalysis of historical data on alginates, chitosans, homogalacturonans, and galactomannans

Author: Xing, Xiaohui, Xing, Kanglin, Hsieh, Yves S. Y., Abbott, D. Wade, Xing, Xiaohui, Xing, Kanglin, Hsieh, Yves S. Y., and Abbott, D. Wade
Abstract: QC 20240618
Published: 2024
Full Text: View/download PDF

22. Orientation selectivity properties for the affine Gaussian derivative and the affine Gabor models for visual receptive fields

Author: Lindeberg, Tony and Lindeberg, Tony
Abstract: This paper presents an in-depth theoretical analysis of the orientation selectivity properties of simple cells and complex cells, that can be well modelled by the generalized Gaussian derivative model for visual receptive fields, with the purely spatial component of the receptive fields determined by oriented affine Gaussian derivatives for different orders of spatial differentiation. A detailed mathematical analysis is presented for the three different cases of either: (i) purely spatial receptive fields, (ii) space-time separable spatio-temporal receptive fields and (iii) velocity-adapted spatio-temporal receptive fields. Closed-form theoretical expressions for the orientation selectivity curves for idealized models of simple and complex cells are derived for all these main cases, and it is shown that the orientation selectivity of the receptive fields becomes more narrow, as a scale parameter ratio $\kappa$, defined as the ratio between the scale parameters in the directions perpendicular to vs. parallel with the preferred orientation of the receptive field, increases. It is also shown that the orientation selectivity becomes more narrow with increasing order of spatial differentiation in the underlying affine Gaussian derivative operators over the spatial domain. Additionally, we also derive closed-form expressions for the resultant and the bandwidth descriptors of the orientation selectivity curves, which have previously been used as compact descriptors of the orientation selectivity properties for biological neurons. These results together show that the properties of the affine Gaussian derivative model for visual receptive fields can be analyzed in closed form, which can be highly useful when to relate the results from biological experiments to computational models of the functional properties of simple cells and complex cells in the primary visual cortex. For comparison, we also present a corresponding theoretical orientation selectivity analysis for purely, QC 20240410
Published: 2024

23. Joint covariance properties under geometric image transformations for spatio-temporal receptive fields according to the generalized Gaussian derivative model for visual receptive fields

Author: Lindeberg, Tony and Lindeberg, Tony
Abstract: The influence of natural image transformations on receptive field responses is crucial for modelling visual operations in computer vision and biological vision. In this regard, covariance properties with respect to geometric image transformations in the earliest layers of the visual hierarchy are essential for expressing robust image operations, and for formulating invariant visual operations at higher levels. This paper defines and proves a set of joint covariance properties under compositions of spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations, which make it possible to characterize how different types of image transformations interact with each other and the associated spatio-temporal receptive field responses. In this regard, we also extend the notion of scale-normalized derivatives to affine-normalized derivatives, to be able to obtain true affine-covariant properties of spatial derivatives, that are computed based on spatial smoothing with affine Gaussian kernels. The derived relations show how the parameters of the receptive fields need to be transformed, in order to match the output from spatio-temporal receptive fields under composed spatio-temporal image transformations. As a side effect, the presented proof for the joint covariance property over the integrated combination of the different geometric image transformations also provides specific proofs for the individual transformation properties, which have not previously been fully reported in the literature. We conclude with a geometric analysis, showing how the derived joint covariance properties make it possible to relate or match spatio-temporal receptive field responses, when observing, possibly moving, local surface patches from different views, under locally linearized perspective or projective transformations, as well as when observing different instances of spatio-temporal events that may occur either faster or slower be, QC 20240429, Covariant and invariant deep networks
Published: 2024

24. Nanometa Live : a user-friendly application for real-time metagenomic data analysis and pathogen identification

Author: Sandås, Kristofer, Lewerentz, Jacob, Karlsson, Edvin, Karlsson, Linda, Sundell, David, Simonyté-Sjödin, Kotryna, Sjodin, Andreas, Sandås, Kristofer, Lewerentz, Jacob, Karlsson, Edvin, Karlsson, Linda, Sundell, David, Simonyté-Sjödin, Kotryna, and Sjodin, Andreas
Abstract: Summary: Nanometa Live presents a user-friendly interface designed for real-time metagenomic data analysis and pathogen identification utilizing Oxford Nanopore Technologies’ MinION and Flongle flow cells. It offers an efficient workflow and graphical interface for the visualization and interpretation of metagenomic data as it is being generated. Key features include automated BLAST validation, streamlined handling of custom Kraken2 databases, and a simplified graphical user interface for enhanced user experience. Nanometa Live is particularly notable for its capability to run without constant internet or server access once installed, setting it apart from similar tools. It provides a comprehensive view of taxonomic composition and facilitates the detection of user-defined pathogens or other species of interest, catering to both researchers and clinicians. Availability and implementation: Nanometa Live has been implemented as a local web application using the Dash framework with Snakemake handling the data processing. The source code is freely accessible on the GitHub repository at https://github.com/FOIBioinformatics/nanometa_live and it is easily installable using Bioconda. It includes containerization support via Docker and Singularity, ensuring ease of use, reproducibility, and portability.
Published: 2024
Full Text: View/download PDF

25. Bioinformatics for microbiome analysis

Author: Delgado, Luis Fernando and Delgado, Luis Fernando
Abstract: Marine ecosystems harbour a vast microbial diversity which play a crucial role in ecosystemfunctioning. Advancements in DNA sequencing technologies have transformed our ability to analyse microbial populations comprehensively. Metagenomic sequencing has emerged as a pivotal tool for characterising microbial communities across various environments. Bioinformatics, an interdisciplinary field, facilitates the analysis and interpretation of large biological datasets, including microbiome data. This thesis aims to enhance bioinformatics approaches for analysing marine microbiomes. It comprises four papers covering bioinformatic developments and genomic data analysis across multiple topics, including metagenomics, pangenomics, comparative genomics and population genomics: Paper I evaluated three assembly strategies for constructing gene catalogues from metagenomic samples: individual sample assembly with gene clustering, co-assembly of all samples, and a new hybrid approach, mix assembly. The efficacy of the mix-assembly approach was highlighted for maximising information extraction from metagenomic samples, offering opportunities for further exploration in microbial ecology and environmental genomics. Using the mix-assembly approach, we conducted a comprehensive analysis of 124 metagenomic samples sourced from the Baltic Sea, resulting in the refinement of the Baltic Sea Gene Set (BAGS v1.1), which now encompasses 66.53 million genes annotated for both functionality and taxonomy. In Paper II, we introduced an open-access initiative that provided the mix-assembly pipeline code. We also developed the BAGS-Shiny web application to facilitate user interaction with this extensive gene catalogue. Paper III focused on whole-genome sequencing and assembly of 82 environmental V. vulnificus strains from the Baltic Sea, enabling comprehensive comparative genomic analysis. I developed the PhyloBOTL pipeline, which uses a phylogeny-based approach to identify genes associated with pat, Havsmiljöer hyser en enorm mikrobiell mångfald som spelar en avgörande roll för ekosystemens funktion. Framsteg inom DNA-sekvenseringstekniker har revolutionerat vår förmåga att analysera den mikrobiella populationen på ett omfattande sätt. Metagenomisk sekvensering har framträtt som ett centralt verktyg för att karakterisera mikrobiella samhällen i olika miljöer. Bioinformatik, ett tvärvetenskapligt fält, underlättar analys och tolkning av stora biologiska dataset, inklusive mikrobiomdata. Den här avhandlingen syftar till att förbättra bioinformatiska metoder för att analysera marina mikrobiom. Den består av fyra artiklar som täcker bioinformatisk utveckling och analys av genomdata inom flera områden, inklusive metagenomik, pangenom, jämförande genomik och populationsgenetik: Artikel I utvärderade tre monteringsstrategier för att konstruera genkataloger från metagenomiska prover: montering av enskilda prover med genglustering, sammontering av alla prover och en ny hybridmetod, mixmontering. Effektiviteten hos mixmonteringsmetoden lyftes fram för att maximera informationsutvinning från metagenomiska prover, vilket öppnar för vidare utforskning inom mikrobiell ekologi och miljögenomik. Med hjälp av mixmonteringsmetoden genomförde vi en omfattande analys av 124 metagenomiska prover från Östersjön, vilket resulterade i förfiningen av Östersjöns gensets (BAGS v1.1), som nu omfattar 66,53 miljoner gener annoterade för både funktion och taxonomi. I artikel II introducerade vi ett öppet initiativ som tillhandahöll koden för mixmonterings-pipeline. Vi utvecklade också BAGS-Shiny webbapplikationen för att underlätta användarinteraktion med denna omfattande gensetskatalog. Artikel III fokuserade på helgenomsekvensering och montering av 82 miljörelaterade V. vulnificus-stammar från Östersjön, vilket möjliggjorde omfattande jämförande genomisk analys. Jag utvecklade PhyloBOTL-pipelinen, som använder en fylogenibaserad metod för att identifiera gener associerade med patogenicite, QC 2024-05-15
Published: 2024

26. Algorithms and Models in Nanopore DNA Sequencing : Advanced Decoding and Modeling with Hierarchical Hidden Markov Models

Author: Xu, Xuechun and Xu, Xuechun
Abstract: Within less than four decades, the nanopore sequencing technology accelerated from an implausible idea scribbled on a notebook page to one of the decisive contributors to the complete sequence of the human genome. Its rapid evolution, particularly in recent years, is driven not only by its inherent innovation of the nanopores but also by synergistic advancements in complementary fields, such as GPU acceleration and deep neural networks, as well as cross-disciplinary influences from domains like speech recognition. However, during this rapid advancement, certain methods within nanopore sequencing remain relatively unexplored. This oversight has the potential to create bottlenecks in the technology's further development. In this thesis work, we delve into these uncharted areas, seeking to fill critical gaps and branch the technology into new frontiers. Our objective is to unleash its potential, enabling further breakthroughs in genomic research and beyond. Through our research, we have developed two novel algorithms and two innovative models tailored to address these under-explored aspects of nanopore sequencing. The two algorithms, the GMBS and the LFBS, both belonging to the MBS group, offer innovative solutions to the challenging decoding problems inherent in HHMMs. They are two distinct variations tailored to different scenarios. While the GMBS is specifically suited for decoding lengthy sequences, such as those encountered in long-read basecalling, the LFBS is optimized for parallel programming and excels in processing short-length sequences. The two innovative models developed in this research, each leveraging variations of HHMMs and employing an end-to-end approach, exhibit distinguished structures. The first model, a hybrid of EDHMM and DNN, showcases the effectiveness of integrating both knowledge-driven and data-driven techniques. In contrast, the second model, a custom-designed Helicase HMM, draws inspiration from pioneering studies on motor proteins found, Inom mindre än fyra decennier har nanopore-sekvenseringsteknologin accelererat från en otrolig idé som skissades i en antckningsbok till en avgörande teknologi som bidragit till den kompletta sekvensensieringen av det mänskliga genomet. Denna snabba utveckling, särskilt under de senaste åren, drivs inte bara av innovation kring nanoporer utan också av synergistiska framsteg inom kompletterande områden, såsom GPU-acceleration och djupa neurala nätverk, samt tvärvetenskapligt inflytande från domäner som taligenkänning. Under denna snabba utveckling har do vissa metoder inom nanopore sekvensering förblivit relativt outforskade. Detta förbiseende riskerar att skapa flaskhalsar i teknologins vidareutveckling. I denna avhandling utforskar vi dessa outforskade områden, i syfte att fylla kritiska luckor och utveckla tekniken mot nya fronter. Vårt mål är att frigöra dess fulla potential och möjliggöra ytterligare genombrott inom genomisk forskning och därutöver. Som del av vår forskning har vi utvecklat två nya algoritmer och två innovativa modeller anpassade för att adressera dessa underutredda aspekter av nanopore sekvensering. De två algoritmerna, GMBS och LFBS, som är instanser av det mer generella ramverket av MBS-algorithmer (\emph{eng.} marginalised beam search), erbjuder innovativa lösningar på de utmanande avkodningsproblemen som är inneboende i HHMM:er. De är två distinkta variationer anpassade för olika scenarier. Medan GMBS är speciellt lämpad för avkodning av långa sekvenser, såsom de som stöts på vid läsning av långa sekvenser, är LFBS optimerad för parallell programmering och utmärker sig i bearbetning av korta sekvenser. De två innovativa modellerna som utvecklats i denna forskning, vilka båda utnyttjar variationer av HHMM:er och använder en ``end-to-end''-ansats, uppvisar distinkta strukturer. Den första modellen, en hybrid av EDHMM och DNN, visar effektiviteten av att integrera både kunskapsdrivna och datadrivna tekniker. I kontrast till detta, drar den and, QC 20240502Hybrid attendance: https://kth-se.zoom.us/j/66248893067
Published: 2024

27. Decoding Genetic Enigmas in Sarcoma

Author: Difilippo, Valeria and Difilippo, Valeria
Abstract: Sarcomas represent a broad and heterogenous group of rare tumors. For some subtypes, there arepathognomonic genetic alterations available, while for others such alterations remain to be identified.Especially in entities that harbor large numbers of complex genetic changes, much still remains to beunderstood. One such entity is osteosarcoma, the most common primary bone tumor. Although primarilyaffecting children and adolescents, this tumor typically presents a chaotic genome. In Papers I-III, wepresent different genetic mutational mechanisms that distinguish osteosarcoma sub-entities with differentbiology and tumor behavior. Namely, we present a recurrent mechanism involving the promoter regionof the TP53 tumor suppressor gene in a subset of conventional osteosarcomas. We demonstrate thatstructural variants abrogate TP53 expression but also relocate its promoter region. By responding toongoing DNA damage, it in turn leads to upregulation of known or putative oncogenes erroneouslytranslocated into its vicinity. Additionally, we subdivide 12q-amplified osteosarcomas into four distinctgroups and show that recurrent promoter swapping events involving the FRS2 and PLEKHA5 regulatoryregions occur in many high-grade and dedifferentiated osteosarcomas with CDK4 and MDM2amplification. Moreover, we found that osteosarcomas with relatively few chromosomal alterations oradult onset are genetically heterogenous. Finally, in the last part of the thesis (Papers III-IV), weintroduced new bioinformatics tools: (i) NAFuse to detect gene fusions; (ii) the genomic complexity score(GCS) to analyze the complexity genome-wide; and (iii) SarcDBase, a tool that integrates genomic andtranscriptomic data with existing information. Collectively, this thesis has advanced our understanding ofthe role played by specific mutations in the development and progression of osteosarcoma and hasintroduced new bioinformatics tools that facilitate the analysis and interpretation of highly complexgenetic in
Published: 2024

28. Machine learning predicts system-wide metabolic flux control in cyanobacteria

Author: Kugler, Amit, Stensjö, Karin, Kugler, Amit, and Stensjö, Karin
Abstract: Metabolic fluxes and their control mechanisms are fundamental in cellular metabolism, offering insights for the study of biological systems and biotechnological applications. However, quantitative and predictive understanding of controlling biochemical reactions in microbial cell factories, especially at the system level, is limited. In this work, we present ARCTICA, a computational framework that integrates constraint-based modelling with machine learning tools to address this challenge. Using the model cyanobacterium Synechocystis sp. PCC 6803 as chassis, we demonstrate that ARCTICA effectively simulates global-scale metabolic flux control. Key findings are that (i) the photosynthetic bioproduction is mainly governed by enzymes within the Calvin-Benson-Bassham (CBB) cycle, rather than by those involve in the biosynthesis of the end-product, (ii) the catalytic capacity of the CBB cycle limits the photosynthetic activity and downstream pathways and (iii) ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO) is a major, but not the most, limiting step within the CBB cycle. Predicted metabolic reactions qualitatively align with prior experimental observations, validating our modelling approach. ARCTICA serves as a valuable pipeline for understanding cellular physiology and predicting rate-limiting steps in genome-scale metabolic networks, and thus provides guidance for bioengineering of cyanobacteria.
Published: 2024
Full Text: View/download PDF

29. Properties of the full random-effect modeling approach with missing covariate data

Author: Nyberg, Joakim, Jonsson, E. Niclas, Karlsson, Mats O., Häggström, Jonas, Nyberg, Joakim, Jonsson, E. Niclas, Karlsson, Mats O., and Häggström, Jonas
Abstract: During drug development, a key step is the identification of relevant covariates predicting between-subject variations in drug response. The full random effects model (FREM) is one of the full-covariate approaches used to identify relevant covariates in nonlinear mixed effects models. Here we explore the ability of FREM to handle missing (both missing completely at random (MCAR) and missing at random (MAR)) covariate data and compare it to the full fixed-effects model (FFEM) approach, applied either with complete case analysis or mean imputation. A global health dataset (20 421 children) was used to develop a FREM describing the changes of height for age Z-score (HAZ) over time. Simulated datasets (n = 1000) were generated with variable rates of missing (MCAR) covariate data (0%-90%) and different proportions of missing (MAR) data condition on either observed covariates or predicted HAZ. The three methods were used to re-estimate model and compared in terms of bias and precision which showed that FREM had only minor increases in bias and minor loss of precision at increasing percentages of missing (MCAR) covariate data and performed similarly in the MAR scenarios. Conversely, the FFEM approaches either collapsed at ≥70% of missing (MCAR) covariate data (FFEM complete case analysis) or had large bias increases and loss of precision (FFEM with mean imputation). Our results suggest that FREM is an appropriate approach to covariate modeling for datasets with missing (MCAR and MAR) covariate data, such as in global health studies.
Published: 2024
Full Text: View/download PDF

30. Automated Analysis of Nano-Impact Single-Entity Electrochemistry Signals Using Unsupervised Machine Learning and Template Matching

Author: Zhao, Ziwen, Naha, Arunava, Ganguli, Sagar, Sekretareva, Alina, Zhao, Ziwen, Naha, Arunava, Ganguli, Sagar, and Sekretareva, Alina
Abstract: Nano-impact (NIE) (also referred to as collision) single-entity electrochemistry is an emerging technique that enables electrochemical investigation of individual entities, ranging from metal nanoparticles to single cells and biomolecules. To obtain meaningful information from NIE experiments, analysis and feature extraction on large datasets are necessary. Herein, a method is developed for the automated analysis of NIE data based on unsupervised machine learning and template matching approaches. Template matching not only facilitates downstream processing of the NIE data but also provides a more accurate analysis of the NIE signal characteristics and variations that are difficult to discern with conventional data analysis techniques, such as the height threshold method. The developed algorithm enables fast automated processing of large experimental datasets recorded with different systems, requiring minimal human intervention and thereby eliminating human bias in data analysis. As a result, it improves the standardization of data processing and NIE signal interpretation across various experiments and applications. Nano-impact (NIE) electrochemistry is an emerging technique for studying individual entities. Analyzing large NIE datasets, often with low signal-to-noise ratios, is challenging. Herein, an automated approach is introduced using unsupervised machine learning and template matching for accurate feature extraction from spike-shaped NIE signals. It improves data processing, accuracy and standardization, reducing human bias in signal interpretation across experiments.image (c) 2023 WILEY-VCH GmbH
Published: 2024
Full Text: View/download PDF

31. Achieving improved accuracy for imputation of ancient DNA

Author: Ausmees, Kristiina, Nettelblad, Carl, Ausmees, Kristiina, and Nettelblad, Carl
Abstract: Motivation Genotype imputation has the potential to increase the amount of information that can be gained from the often limited biological material available in ancient samples. As many widely used tools have been developed with modern data in mind, their design is not necessarily reflective of the requirements in studies of ancient DNA. Here, we investigate if an imputation method based on the full probabilistic Li and Stephens model of haplotype frequencies might be beneficial for the particular challenges posed by ancient data. Results We present an implementation called prophaser and compare imputation performance to two alternative pipelines that have been used in the ancient DNA community based on the Beagle software. Considering empirical ancient data downsampled to lower coverages as well as present-day samples with artificially thinned genotypes, we show that the proposed method is advantageous at lower coverages, where it yields improved accuracy and ability to capture rare variation. The software prophaser is optimized for running in a massively parallel manner and achieved reasonable runtimes on the experiments performed when executed on a GPU., eSSENCE - An eScience Collaboration
Published: 2023
Full Text: View/download PDF

32. Jack of all trades, master of none : the multifaceted nature of H3K36 methylation

Author: Lindehell, Henrik and Lindehell, Henrik
Abstract: Post-translational modifications of histones enable differential transcriptional control of the genome between cell types and developmental stages, and in response to environmental factors. The methylation of Histone 3 Lysine 36 (H3K36) is one the most complex and well-studied histone modifications and is known to be involved in a wide range of molecular processes. Commonly associated with active genes and transcriptional elongation, H3K36 methylation also plays a key role in DNA repair, repression of cryptic transcription, and guiding additional post-translational modifications to histones, genomic DNA, and RNA. In Drosophila melanogaster, trimethylated H3K36 has also been linked to dosage compensation of the single male X chromosome as a binding substrate for the Male-Specific Lethal (MSL) complex. However, this model has been challenged by structural and biochemical studies demonstrating higher MSL complex affinity for other methylated lysines. There is an additional system of chromosome-specific gene regulation in D. melanogaster where transcription from the small heterochromatic fourth chromosome is increased by Painting of fourth (POF), a protein specifically binding nascent RNA on the fourth chromosome. The fourth chromosome is thought to have been an ancestral X chromosome that reverted into an autosome. POF mediating high transcription levels from an autosome is believed to be a remnant of an ancient sex-chromosome dosage compensation mechanism. Proximity ligation assays revealed no interaction between MSL complex components and methylated H3K36. This finding was corroborated by RNA sequencing of H3K36 methylation impaired mutants: the transcriptional output of the male X chromosome was unaffected in mutants where Lysine 36 on Histone 3 was replaced by an Arginine, abolishing methylation of this site. However, we found that knocking out Set2, which encodes the methyltransferase responsible for H3K36 trimethylation, significantly reduced X-linked transcript
Published: 2023

33. Covariance properties under natural image transformations for the generalized Gaussian derivative model for visual receptive fields

Author: Lindeberg, Tony and Lindeberg, Tony
Abstract: The property of covariance, also referred to as equivariance, means that an image operator is well-behaved under image transformations, in the sense that the result of applying the image operator to a transformed input image gives essentially a similar result as applying the same image transformation to the output of applying the image operator to the original image. This paper presents a theory of geometric covariance properties in vision, developed for a generalized Gaussian derivative model of receptive fields in the primary visual cortex and the lateral geniculate nucleus, which, in turn, enable geometric invariance properties at higher levels in the visual hierarchy. It is shown how the studied generalized Gaussian derivative model for visual receptive fields obeys true covariance properties under spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations. These covariance properties imply that a vision system, based on image and video measurements in terms of the receptive fields according to the generalized Gaussian derivative model, can, to first order of approximation, handle the image and video deformations between multiple views of objects delimited by smooth surfaces, as well as between multiple views of spatio-temporal events, under varying relative motions between the objects and events in the world and the observer. We conclude by describing implications of the presented theory for biological vision, regarding connections between the variabilities of the shapes of biological visual receptive fields and the variabilities of spatial and spatio-temporal image structures under natural image transformations. Specifically, we formulate experimentally testable biological hypotheses as well as needs for measuring population statistics of receptive field characteristics, originating from predictions from the presented theory, concerning the extent to which the shapes of the biological receptiv, QC 20230328, Covariant and invariant deep networks
Published: 2023

34. Orientation selectivity of affine Gaussian derivative based receptive fields

Author: Lindeberg, Tony and Lindeberg, Tony
Abstract: This paper presents a theoretical analysis of the orientation selectivity of simple and complex cells that can be well modelled by the generalized Gaussian derivative model for visual receptive fields, with the purely spatial component of the receptive fields determined by oriented affine Gaussian derivatives for different orders of spatial differentiation. A detailed mathematical analysis is presented for the three different cases of either: (i) purely spatial receptive fields, (ii) space-time separable spatio-temporal receptive fields and (iii) velocity-adapted spatio-temporal receptive fields. Closed-form theoretical expressions for the orientation selectivity curves for idealized models of simple and complex cells are derived for all these main cases, and it is shown that the degree of orientation selectivity of the receptive fields increases with a scale parameter ratio $\kappa$, defined as the ratio between the scale parameters in the directions perpendicular to vs. parallel with the preferred orientation of the receptive field. It is also shown that the degree of orientation selectivity increases with the order of spatial differentiation in the underlying affine Gaussian derivative operators over the spatial domain. We conclude by describing biological implications of the derived theoretical results, demonstrating that the predictions from the presented theory are consistent with previously established biological results concerning broad vs. sharp orientation tuning of visual neurons in the primary visual cortex. We also demonstrate that the above theoretical predictions, in combination with these biological results, are consistent with a previously formulated biological hypothesis, stating that the biological receptive field shapes should span the degrees of freedom in affine image transformations, to support affine covariance over the population of receptive fields in the primary visual cortex., QC 20230425, Covariant and invariant deep networks
Published: 2023

35. Improved computations for relationship inference using low-coverage sequencing data

Author: Mostad, Petter, Tillmar, Andreas, Kling, Daniel, Mostad, Petter, Tillmar, Andreas, and Kling, Daniel
Abstract: Pedigree inference, for example determining whether two persons are second cousins or unrelated, can be done by comparing their genotypes at a selection of genetic markers. When the data for one or more of the persons is from low-coverage next generation sequencing (lcNGS), currently available computational methods either ignore genetic linkage or do not take advantage of the probabilistic nature of lcNGS data, relying instead on first estimating the genotype. We provide a method and software (see familias.name/lcNGS) bridging the above gap. Simulations indicate how our results are considerably more accurate compared to some previously available alternatives. Our method, utilizing a version of the Lander-Green algorithm, uses a group of symmetries to speed up calculations. This group may be of further interest in other calculations involving linked loci., Funding Agencies|Chalmers University of Technology
Published: 2023
Full Text: View/download PDF

36. KIF-Key Interactions Finder : A program to identify the key molecular interactions that regulate protein conformational changes

Author: Crean, Rory M., Slusky, Joanna S. G., Kasson, Peter M., Kamerlin, Shina C. Lynn, Crean, Rory M., Slusky, Joanna S. G., Kasson, Peter M., and Kamerlin, Shina C. Lynn
Abstract: Simulation datasets of proteins (e.g., those generated by molecular dynamics simulations) are filled with information about how a non-covalent interaction network within a protein regulates the conformation and, thus, function of the said protein. Most proteins contain thousands of non-covalent interactions, with most of these being largely irrelevant to any single conformational change. The ability to automatically process any protein simulation dataset to identify non-covalent interactions that are strongly associated with a single, defined conformational change would be a highly valuable tool for the community. Furthermore, the insights generated from this tool could be applied to basic research, in order to improve understanding of a mechanism of action, or for protein engineering, to identify candidate mutations to improve/alter the functionality of any given protein. The open-source Python package Key Interactions Finder (KIF) enables users to identify those non-covalent interactions that are strongly associated with any conformational change of interest for any protein simulated. KIF gives the user full control to define the conformational change of interest as either a continuous variable or categorical variable, and methods from statistics or machine learning can be applied to identify and rank the interactions and residues distributed throughout the protein, which are relevant to the conformational change. Finally, KIF has been applied to three diverse model systems (protein tyrosine phosphatase 1B, the PDZ3 domain, and the KE07 series of Kemp eliminases) in order to illustrate its power to identify key features that regulate functionally important conformational dynamics.
Published: 2023
Full Text: View/download PDF

37. Discovery of Chemical Probes through Structure-based Virtual Screening of Vast Compound Databases

Author: Luttens, Andreas and Luttens, Andreas
Abstract: Bioactive molecules have traditionally been discovered through labor-intensive screening methods in which individual compounds are tested against specific protein targets or cells to identify those that produce the desired biological effect. However, these approaches have significant limitations. Firstly, the number of molecules that can be tested in a standard laboratory is restricted, and the acquisition and curation of these compounds come at a high cost. Secondly, these methods are time-consuming because each compound must be tested individually, and they are confined to small libraries with very limited chemical space coverage. In contrast, structure-based virtual screening can rapidly predict a molecule's interaction with a target protein, allowing for the evaluation of enormous libraries of chemical substances. Furthermore, this approach is not restricted to physically available molecules and can be extended to virtual compounds. Commercial chemical space has recently grown exponentially and currently contains several billion molecules that can be readily synthesized and delivered for experimental testing within weeks. Despite the enormous potential of these databases for drug discovery, they also pose new challenges, and development of effective strategies is required to explore ultralarge libraries. The goal of this thesis was to develop and apply novel strategies focused on exploring the potential of ultralarge chemical libraries using structure-based virtual screening. Publication I summarizes best practices on large-scale virtual screening and benchmarking protocols for molecular docking calculations. Publication II describes a docking screen of several hundred million lead-like molecules against the SARS-CoV-2 main protease, leading to promising starting points for development of coronavirus inhibitors. The binding modes predicted by docking were confirmed experimentally by X-ray crystallography. After several rounds of optimization, nanomolar broad-spe
Published: 2023

38. The revolutionary partnership of computation and biology

Author: Rivas-Carrillo, Salvador Daniel and Rivas-Carrillo, Salvador Daniel
Abstract: The organization of living beings is complex. Science uses modeling in order to gain a deeper understanding, and to be able to manipulate the processes of living organisms. To this purpose, I used and developed computational tools to investigate and model different relevant biological phenomena. In paper I, I utilized whole-genome data from wild and domesticated European rabbit (Oryctolagus cuniculus sp.) populations to identify segregating insertions of endogenous retroviruses and compare their variation along the host phylogeny and domestication history. The results from this study highlight the importance of genomic modeling beyond reference organisms and reference individuals, and provide deep insights regarding strategies for variant analyses in host population comparative genomics. In paper IV, I studied the process of exaptation of foreign genetic elements at broad-scale by observing the presence and characteristics of retroviral env gene, syncytin, across vertebrates. I searched a library of more than 150 chromosome-length assemblies covering 17 taxonomical orders for syncytin homologs, where I identified and syntenically aligned over 300 loci insertions, including not previously known insertions. Additionally, three-dimensional structures of the recovered sequences were predicted using AlphaFold2. Phylogenomics analyses suggest a complex dynamic of multiple retroviral insertions at different time points with sequence conservation specific to clades that share a similar histo-physiological placental type. In paper II, I expanded the scope to encompass translational medicine by developing an unsupervised machine learning methodology for detecting anomalies in biomedical signals, MindReader, which I applied primarily to electroencephalogram. In paper III, I developed a hidden Markov model implementation that includes a hypothesis generator for stream time-domain signals, which is used as a dependency for paper II. The work in this thesis substantiates that a
Published: 2023

39. Learning-based prediction, representation, and multimodal registration for bioimage processing

Author: Pielawski, Nicolas and Pielawski, Nicolas
Abstract: Microscopy and imaging are essential to understanding and exploring biology. Modern staining and imaging techniques generate large amounts of data resulting in the need for automated analysis approaches. Many earlier approaches relied on handcrafted feature extractors, while today's deep-learning-based methods open up new ways to analyze data automatically. Deep learning has become popular in bioimage processing as it can extract high-level features describing image content (Paper III). The work in this thesis explores various aspects and limitations of machine learning and deep learning with applications in biology. Learning-based methods have generalization issues on out-of-distribution data points, and methods such as uncertainty estimation (Paper II) and visual quality control (Paper V) can provide ways to mitigate those issues. Furthermore, deep learning methods often require large amounts of data during training. Here the focus is on optimizing deep learning methods to meet current computational capabilities and handle the increasing volume and size of data (Paper I). Model uncertainty and data augmentation techniques are also explored (Papers II and III). This thesis is split into chapters describing the main components of cell biology, microscopy imaging, and the mathematical and machine-learning theories to give readers an introduction to biomedical image processing. The main contributions of this thesis are deep-learning methods for reconstructing patch-based segmentation (Paper I) and pixel regression of traction force images (Paper II), followed by methods for aligning images from different sensors in a common coordinate system (named multimodal image registration) using representation learning (Paper III) and Bayesian optimization (Paper IV). Finally, the thesis introduces TissUUmaps 3, a tool for visualizing multiplexed spatial transcriptomics data (Paper V). These contributions provide methods and tools detailing how to apply mathematical frameworks a
Published: 2023

40. A time-causal and time-recursive scale-covariant scale-space representation of temporal signals and past time

Author: Lindeberg, Tony and Lindeberg, Tony
Abstract: This article presents an overview of a theory for performing temporal smoothing on temporal signals in such a way that: (i) temporally smoothed signals at coarser temporal scales are guaranteed to constitute simplifications of corresponding temporally smoothed signals at any finer temporal scale (including the original signal) and (ii) the temporal smoothing process is both time-causal and time-recursive, in the sense that it does not require access to future information and can be performed with no other temporal memory buffer of the past than the resulting smoothed temporal scale-space representations themselves. For specific subsets of parameter settings for the classes of linear and shift-invariant temporal smoothing operators that obey thisproperty, it is shown how temporal scale covariance can be additionally obtained, guaranteeing that if the temporal input signal is rescaled by a uniform temporal scaling factor, then also the resulting temporal scale-space representations of the rescaled temporal signal will constitute mere rescalings of the temporal scale-space representations of the original input signal, complemented by a shift along the temporal scale dimension. The resulting time-causal limit kernel that obeys this property constitutes a canonical temporal kernel for processing temporal signals in real-time scenarios when the regular Gaussian kernel cannot be used, because of its non-causal access to information from the future, and we cannot additionally require the temporal smoothing process to comprise a complementary memory of the past beyond the information contained in the temporal smoothing process itself, which in this way also serves as a multi-scale temporal memory of the past. We describe how the time-causal limit kernel relates to previously used temporal models, such as Koenderink's scale-time kernels and the ex-Gaussian kernel. We do also give an overview of how the time-causal limit kernel can be used for modelling the temporal processing, Not duplicate with DiVA which is a reportQC 20221202, Scale-space theory for covariant and invariant visual perception
Published: 2023
Full Text: View/download PDF

41. Inter and intra-tumor models of somatic evolution in cancer

Author: Mohaghegh Neyshabouri, Mohammadreza and Mohaghegh Neyshabouri, Mohammadreza
Abstract: Cancer is a disease caused by the accumulation of somatic mutations in an evolutionary process. Mutations in so-called cancer driver genes provide the harboring cells with particular selective advantages and result in cancer progression. Identification of the driver genes and their interrelations is critical for a wide range of research and clinical applications. This thesis investigates the problem of modeling the cancer evolution dynamics using probabilistic cancer progression models. Such models aim to explain the mechanism of accumulation of mutations in the tumor cells and how specific mutations may exert promoting or inhibiting effects on each other. We introduce a set of computational methods to analyze cross-sectional data from a cohort of tumors and infer the interrelations among cancer driver genes, represented by a graphical structure over them. In our first two papers, following the typical setting in the cancer progression model studies, we use a simple representation for the tumors in which a single genotype vector models each tumor. We introduce a pathway linear progression model in the first paper and a generalized tree-structured model in the second. Using novel dynamic programming procedures for calculating the likelihoods, we build Markov Chain Monte Carlo (MCMC) inference algorithms for our models in these papers. Using these fast and efficient MCMC algorithms enables us to study massive datasets that were infeasible to be investigated by previously introduced methods. In our third paper, we introduce a framework for taking a finer representation of the tumors into account for inferring progression models. With the rapid improvements in the amount and quality of available data, we can now work with vast numbers of reliably reconstructed tumor clonal trees. In our third paper, we introduce a method that takes such clonal trees from cohorts of tumors as its input and identifies the interrelations among the driver genes within a single tumor or acro, Cancer är en sjukdom som orsakas av ackumulering av somatiska mutationer i en evolutionär process. Mutationer i så kallade drivgener ger cellerna särskilda selektiva fördelar och resulterar i cancerprogression. Identifiering av drivgenerna och deras inbördes samband är avgörande för ett brett spektrum av forskning och kliniska tillämpningar. Denna avhandling undersöker problemet med att modellera cancerutvecklingens dynamik med hjälp av probabilistiska progressmodeller. Sådana modeller syftar till att förklara mekanismen för ackumulering av mutationer i tumörcellerna och hur specifika mutationer kan utöva främjande eller hämmande effekter på varandra. Vi introducerar en uppsättning beräkningsmetoder för att analysera data från en kohort av tumörer och sluta sig till sambanden mellan drivargener, representerade av en grafisk struktur över dem. I våra första två artiklar, efter den typiska miljön i cancerprogressionsmodellstudierna, använder vi en enkel representation för tumörerna där en enda genotypvektor modellerar varje tumör. Vi introducerar en linjär progressionsmodell i den första artikeln och en generaliserad trädstrukturerad modell i den andra. Med hjälp av nya dynamiska programmeringsprocedurer för att beräkna sannolikheterna utvecklar vi Markov Chain Monte Carlo (MCMC) slutledningsalgoritmer för våra modeller i dessa artiklar. Genom att använda dessa snabba och effektiva MCMC-algoritmer kan vi studera massiva datamängder som var omöjliga att undersöka med tidigare introducerade metoder. I vårt tredje dokument introducerar vi ett ramverk för att ta hänsyn till en finare representation av tumörerna för att sluta sig till progressionsmodeller. Med de snabba förbättringarna i mängden och kvaliteten på tillgänglig data kan vi nu arbeta med ett stort antal tillförlitligt rekonstruerade tumörklonala träd. I vårt tredje dokument introducerar vi en metod som tar sådana klonala träd från kohorter av tumörer som sin input och identifierar sambanden mellan drivgenerna, QC 20230131
Published: 2023

42. Early Prediction of Dementia Using Feature Extraction Battery (FEB) and Optimized Support Vector Machine (SVM) for Classification

Author: Javeed, Ashir, Dallora, Ana Luiza, Sanmartin Berglund, Johan, Idrisoglu, Alper, Ali, Liaqat, Rauf, Hafiz Tayyab, Anderberg, Peter, Javeed, Ashir, Dallora, Ana Luiza, Sanmartin Berglund, Johan, Idrisoglu, Alper, Ali, Liaqat, Rauf, Hafiz Tayyab, and Anderberg, Peter
Abstract: Dementia is a cognitive disorder that mainly targets older adults. At present, dementia has no cure or prevention available. Scientists found that dementia symptoms might emerge as early as ten years before the onset of real disease. As a result, machine learning (ML) scientists developed various techniques for the early prediction of dementia using dementia symptoms. However, these methods have fundamental limitations, such as low accuracy and bias in machine learning (ML) models. To resolve the issue of bias in the proposed ML model, we deployed the adaptive synthetic sampling (ADASYN) technique, and to improve accuracy, we have proposed novel feature extraction techniques, namely, feature extraction battery (FEB) and optimized support vector machine (SVM) using radical basis function (rbf) for the classification of the disease. The hyperparameters of SVM are calibrated by employing the grid search approach. It is evident from the experimental results that the newly proposed model (FEB-SVM) improves the dementia prediction accuracy of the conventional SVM by 6%. The proposed model (FEB-SVM) obtained 98.28% accuracy on training data and a testing accuracy of 93.92%. Along with accuracy, the proposed model obtained a precision of 91.80%, recall of 86.59, F1-score of 89.12%, and Matthew’s correlation coefficient (MCC) of 0.4987. Moreover, the newly proposed model (FEB-SVM) outperforms the 12 state-of-the-art ML models that the researchers have recently presented for dementia prediction., CC BY 4.0This research received no external funding.Correspondence: peter.anderberg@bth.se
Published: 2023
Full Text: View/download PDF

43. Predicting and classifying atrial fibrillation from ECG recordings using machine learning

Author: Bogstedt, Carl and Bogstedt, Carl
Abstract: Atrial fibrillation is one of the most common types of heart arrhythmias, which can cause irregular, weak and fast atrial contractions up to 600 beats per minute. Atrial fibrillation has increased prevalence with age and is associated with increased risks of ischemia, as blood clots can form due to the weak contractions. During prolonged periods of atrial fibrillation, the atria can undergo a process called atrial remodelling. This causes electrophysiological and structural changes to the atria such as increased atrial size and changes to calcium ion densities. These changes themselves promotes the initiation and propagation of atrial fibrillation, which makes early detection crucial. Fortunately, atrial fibrillation can be detected on an electrocardiogram. Electrocardiograms measures the electrical activity of the heart during its cardiac cycle. This includes the initiation of the action potential, the depolarization of the atria and ventricles and their repolarization. On the electrocardiogram recording, these are seen as peaks and valleys, where each peak and valley can be traced back to one of these events. This means that during atrial fibrillation, the weak, irregular and fast atrial contractions can all be detected and measured. The aim of this project was to develop a machine learning model that could predict onset of atrial fibrillation, and that could classify ongoing atrial fibrillation. This was achieved by training one multiclass classification machine learning model using XGBoost, and three binary classification machine learning models using ROSETTA, on electrocardiogram recordings of people with and without atrial fibrillation. XGBoost is a tree boosting system which uses tree-like structures to classify data, while ROSETTA is a rule-based classification model which creates rules in an IF and THEN format to make decisions. The recordings were labelled according to three different classes: no atrial fibrillation, atrial fibrillation or preceding atrial
Published: 2023

44. Developing a reproducible bioinformatics workflow for canine inherited retinal disease

Author: Martin, Melina Toni Marie and Martin, Melina Toni Marie
Abstract: Inherited Retinal Degenerations (IRDs) are a heterogenous group of diseases which lead to vision impairment and can be found both in humans and in dogs. About 1 in 1,380 humans is estimated to suffer from an autosomal recessive IRD, which would be 5.5 million people worldwide, and many more are estimated to be unaffected carriers. This makes autosomal recessive IRDs likely the most common group of Mendelian diseases in humans. Today, about 300 genetic mutations have been connected to cause retinal diseases in humans. Whilst in dogs only 32 genes have been identified, numerous eye conditions have been described where the genetic cause has not yet been identified. This suggests that there are much more genetic causes to discover in the dog genome. Additionally, the dog serves well as a model organism to investigate IRDs as it is sharing morphological and genetic similarities with humans. For these reasons, proper software, a canine reference genome of high quality, and smart implementation of bioinformatic tools and methods are a big advantage to increase chances of finding new causative genetic variants and subsequently enable faster detection of possible preventions of the disease or at least alleviating its symptoms via early diagnosis. In this project, a pre-existing pipeline consisting of Bash scripts was stepwise improved with the goal to increase its efficiency. After controlling whether previous data could still be reproduced with the old pipeline in a first step, the software was exchanged to more updated versions in a second step. A main change was the replacement of the mapping tool Burrows-Wheeler Aligner (BWA) from bwa mem to bwa-mem2 mem, and the update of deprecated Genome Analysis Toolkit (GATK) 3.7 to version 4.3 or 4.4. Thirdly, the scripts were adapted from using the older canine reference genome CanFam3.1 to CanFam4. In a fourth step, for automatization and fastening the running time, the pipeline steps were implemented into the workflow management
Published: 2023

45. Developing Automated Cell Segmentation Models Intended for MERFISH Analysis of the Cardiac Tissue by Deploying Supervised Machine Learning Algorithms

Author: Rune, Julia and Rune, Julia
Abstract: Följande studie behandlar utvecklandet av automatiserade cellsegmenteringsmodeller med avsikt att identifiera gränser mellan celler i hjärtvävnad. Syftet är att möjliggöra analys av data genererad från multiplexed error-robust in situ hybridization (MERFISH). MERFISH är en spatial transcriptomics-teknik som till skillnad från exempelvis single-cell RNA sequencing (ScRNA-seq) och single molecule fluorescence in situ hybridization (smFISH), möjliggör profilering av hundratals RNA-sekvenser hos enskilda celler utan att förlora dess rumsliga kontext. I Kosuri laboratoriet på Salk Institute of Biological Studies i San Diego tillämpas MERFISH på mushjärtan. Syftet är att få en djupare insikt i hur celler är organiserade i friska hjärtan, och hur denna struktur ändras i och med åldring och sjukdom. Att extrahera meningsfull information från MERFISH medför dock en betydande utmaning - en exakt cellsegmentering. Studien bidrar följaktligen till utvecklandet av segmenteringsmodeller för att kringgå de utmaningar som står i vägen för all efterföljande analys. Då klassiska segmenteringsalgoritmer är otillräckliga för att segmentera den komplexa vävnad som hjärtat utgörs av, tillämpades några av dagens mest avancerade och framstående maskininlärningsalgoritmer inom fältet, kallade Cellpose och Omnipose. Givet den täta och heterogena hjärtvävnaden, som härstammar från en bred distribution av celltyper och geometrier, utvecklades två separata modeller; en för att täcka både mindre celler och kardiomyocyter skurna på tvärsnittet; och en för att enbart segmentera kardiomyocyter skurna i longitudinell riktning. Den förstnämnda modellen utvecklades och tränades i Cellpose, och uppnådde en träffsäkerhet på 91.2%. Modellen för longitudinella kardiomyocyter utvecklades istället både i Cellpose och Omnipose för att utvärdera vilket nätverk som är bäst lämpat för ändamålet. Ingen av nätverken lyckades uppnå en tillräckligt hög träffsäkerhet för att vara applicerbar, och är därmed i behov a, The following study delves into the development of automated cell segmentation models, with the intention of identifying boundaries between cells in the cardiac tissue for analysing spatial transcriptomics data. Addressing the limitations of alternative techniques like single-cell RNA sequencing (ScRNA-seq) and single molecule fluorescence in situ hybridization (smFISH), the study underscores the innovative use of multiplexed error-robust fluorescence in situ hybridization (MERFISH) deployed by the Kosuri Lab at Salk Institute for Biological Studies. This advanced imaging-based technique allows for a single-cell transcriptome profiling of hundreds of different transcripts while retaining the spatial context of the tissue. The technique can accordingly reveal how the organization of cells within a healthy heart is altered during disease. However, the extraction of meaningful data from MERFISH poses a significant challenge - accurate cell segmentation. This thesis therefore presents the development of a robust model for cell boundary identification within cardiac tissue, leveraging some of the advanced supervised machine learning algorithms in the field, named Cellpose and Omnipose. Due to the dense and highly heterogeneous tissue- stemming from a wide distribution of cell types and shapes- two separate models had to be developed; one that covers the smaller cells and the cross-sectioned cardiomyocytes, and correspondingly one to cover the longitudinal cardiomyocytes. The cross-section model was successfully developed to achieve an accuracy of 91.2%, whereas the longitudinal model still needs further improvements before being implemented. The thesis acknowledges potential areas for improvement, emphasizing the need to further improve the segmentation of longitudinal cardiomyocytes, tackle the challenges with segmenting cells within fibrotic regions of the diseased heart, as well as achieving a precise 3D cell segmentation. Nonetheless, the generated models have paved
Published: 2023

46. Predicting morphological effect of compounds on COVID-19 infected cells

Author: Öhrner, Viktor and Öhrner, Viktor
Abstract: The cost of developing new drugs is high and the aim of computer-assisted drug discovery is to reduce that development cost, either through virtual screening or generating novel compounds. System biology is one approach to drug discovery where the response of a biological system is the subject of study, instead of drug target interaction. One way to observe a biological system is through microscopy images that are taken of cells perturbed with compounds. Image software extracts information called morphological profiles from the images that can be used for data hungry models. One of the ways artificial intelligence has been applied to drug discovery is with generative models that can generate new compounds. One such generative model is reinforcement learning that employs a critic to guide the generation of compounds towards desirable behaviors. In this study different machine learning models were tested if they could predict the morphological response of COVID-19 infected cells to compounds from their structure. No modells showed any promising results. The reason that no model performed well was because of the dataset. There is a lot of variance in the dataset, meaning that the response to the same compound varies. There was also a lot of difference between the compounds in the dataset, meaning that any representation that the model learns does not transfer over to other compounds. The data set was also imbalanced with more inactive compounds.
Published: 2023

47. Intersecting Graph Representation Learning and Cell Profiling : A Novel Approach to Analyzing Complex Biomedical Data

Author: Chamyani, Nima and Chamyani, Nima
Abstract: In recent biomedical research, graph representation learning and cell profiling techniques have emerged as transformative tools for analyzing high-dimensional biological data. The integration of these methods, as investigated in this study, has facilitated an enhanced understanding of complex biological systems, consequently improving drug discovery. The research aimed to decipher connections between chemical structures and cellular phenotypes while incorporating other biological information like proteins and pathways into the workflow. To achieve this, machine learning models' efficacy was examined for classification and regression tasks. The newly proposed graph-level and bio-graph integrative predictors were compared with traditional models. Results demonstrated their potential, particularly in classification tasks. Moreover, the topology of the COVID-19 BioGraph was analyzed, revealing the complex interconnections between chemicals, proteins, and biological pathways. By combining network analysis, graph representation learning, and statistical methods, the study was able to predict active chemical combinations within inactive compounds, thereby exhibiting significant potential for further investigations. Graph-based generative models were also used for molecule generation opening up further research avenues in finding lead compounds. In conclusion, this study underlines the potential of combining graph representation learning and cell profiling techniques in advancing biomedical research in drug repurposing and drug combination. This integration provides a better understanding of complex biological systems, assists in identifying therapeutic targets, and contributes to optimizing molecule generation for drug discovery. Future investigations should optimize these models and validate the drug combination discovery approach. As these techniques continue to evolve, they hold the potential to significantly impact the future of drug screening, drug repurposing, and dr
Published: 2023

48. Panacea: Predicting anti-aging combinations from expression analysis

Author: Jatti, Ashwini and Jatti, Ashwini
Abstract: Identifying interventions, such as drugs, that can counteract the effects of aging is crucial due to the complex nature of the aging process, which involves multiple biological processes. By targeting these processes, interventions have the potential to promote healthy aging. Utilizing pairs of drugs that exhibit synergistic effects becomes particularly effective as they can simultaneously impact multiple pathways associated with aging and reprogramming, enhancing their anti-aging potential. The Panacea (predicting anti-aging combinations from expression analysis) framework was developed to facilitate the discovery of such drug combinations. Deep generative models were incorporated into the Panacea framework to effectively capture complex patterns in gene expression data, leveraging their non-linear nature for an accurate representation of relationships and interactions. This makes them ideal for predicting drug combinations. The trained models, using the CMap dataset, demonstrated an improved performance to predict the effect of drugs. The age effect of these drug combinations was evaluated using an age-predictive model, revealing that synergistic anti-aging combinations mainly comprised reprogramming (the process of transforming one type of cell into another by altering its gene expression and properties), apoptosis (programmed cell death mechanism), and chemotherapy drugs, while pro-aging combinations involved cellular growth-limiting, longevity-extending, and chemotherapy drugs. These results emphasize the capability of deep generative models in predicting potent drug combinations for anti-aging and anti-cancer interventions.
Published: 2023

49. Predicting tumour growth-driving interactions from transcriptomic data using machine learning

Author: Stigenberg, Mathilda and Stigenberg, Mathilda
Abstract: The mortality rate is high for cancer patients and treatments are only efficient in a fraction of patients. To be able to cure more patients, new treatments need to be invented. Immunotherapy activates the immune system to fight against cancer and one treatment targets immune checkpoints. If more targets are found, more patients can be treated successfully. In this project, interactions between immune and cancer cells that drive tumour growth were investigated in an attempt to find new potential targets. This was achieved by creating a machine learning model that finds genes expressed in cells involved in tumour-driving interactions. Single-cell RNA sequencing and spatial transcriptomic data from breast cancer patients were utilised as well as single-cell RNA sequencing data from healthy patients. The tumour rate was based on the cumulative expression of G2/M genes. The G2/M related genes were excluded from the analysis since these were assumed to be cell cycle genes. The machine learning model was based on a supervised variational autoencoder architecture. By using this kind of architecture, it was possible to compress the input into a low dimensional space of genes, called a latent space, which was able to explain the tumour rate. Optuna hyperparameter optimizer framework was utilised to find the best combination of hyperparameters for the model. The model had a R2 score of 0.93, which indicated that the latent space was able to explain the growth rate 93% accurately. The latent space consisted of 20 variables. To find out which genes that were in this latent space, the correlation between each latent variable and each gene was calculated. The genes that were positively correlated or negatively correlated were assumed to be in the latent space and therefore involved in explaining tumour growth. Furthermore, the correlation between each latent variable and the growth rate was calculated. The up- and downregulated genes in each latent variable were kept and used for
Published: 2023

50. Computational prediction of cell-cell interactions in the brain-tumour microenvironment

Author: Camargo Romera, Paula and Camargo Romera, Paula
Abstract: Glioblastoma is the fastest-growing, and the most common malignant brain tumour in adults. It is normally treated with surgery and radio- or chemotherapy, but the approximate life expectancy is of 15 months with a high probability of cancer recurring. Therefore, there is a need for decreasing its severity. Bulk and single-cell RNA sequencing allow the identification of cellular states in tumours affected by cell-intrinsic and extrinsic factors. Four different cellular states have been identified in glioblastoma: neural progenitor-like, oligodendrocyte progenitor-like, astrocyte-like, and mesenchymal-like. As glioblastoma is an immunosuppressive tumour, it can alter the immune system and increase the tumour's immune escaping by segregating immunosuppressive factors or interacting with the brain microenvironment.Two datasets were used in this study to explore if the localization of the tumour in the brain microenvironment and the tendency of glioblastomas to activate microglial cells are due to particular ligand-receptor interactions. Data quality control was applied to both datasets and SingleCellSignalR and CellphoneDB packages were used to predict the possible interactions. A total of seven experiments were designed for this study. The first dataset, GBmap, allowed us to do a comparison between tumour cells and microglia, tumour cells and other cell types in the brain, and the four cellular states of glioblastoma with microglia and macrophages. Next, healthy microglia from GBmap was used to compare with the tumour bulk data from the second dataset, HGCC. The bootstrap technique was performed to compare bulk data vs single-cell data, and a comparison between tumour cells and microglia or other cell types was analysed.Results showed specific and shared interactions between cell types or cellular states, revealing the different localization of the tumour cells depends on the expressed ligand-receptor pairs. Also, a total of four patterns of interactions were found in
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

3,313 results on '"Bioinformatics (Computational Biology)"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources