Author: "Goh WWB" / Search Limiters: Academic (Peer-Reviewed) Journals - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Goh WWB"' showing total 89 results

Start Over Author "Goh WWB" Search Limiters Academic (Peer-Reviewed) Journals

89 results on '"Goh WWB"'

1. Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.

Author: Peng, H, Wang, H, Kong, W, Li, J, Goh, WWB, Peng, H, Wang, H, Kong, W, Li, J, and Goh, WWB
Abstract: Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.
Published: 2024

2. ProInfer: An interpretable protein inference tool leveraging on biological networks.

Author: Elofsson, A, Peng, H, Wong, L, Goh, WWB, Elofsson, A, Peng, H, Wong, L, and Goh, WWB
Abstract: In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide which maps to more than one candidate protein) and enhances the statistical confidence for proteins supported by both unique and ambiguous peptides. Consequently, ProInfer rescues weakly supported proteins thereby improving proteome coverage. Evaluated across THP1 cell line, lung cancer and RAW267.4 datasets, ProInfer always infers the most numbers of true positives, in comparison to mainstream protein inference tools Fido, EPIFANY and PIA. ProInfer is also adept at retrieving differentially expressed proteins, signifying its usefulness for functional analysis and phenotype profiling. Source codes of ProInfer are available at https://github.com/PennHui2016/ProInfer.
Published: 2023

3. ProJect: a powerful mixed-model missing value imputation method.

Author: Kong, W, Wong, BJH, Hui, HWH, Lim, KP, Wang, Y, Wong, L, Goh, WWB, Kong, W, Wong, BJH, Hui, HWH, Lim, KP, Wang, Y, Wong, L, and Goh, WWB
Abstract: Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is availabl
Published: 2023

4. How missing value imputation is confounded with batch effects and what you can do about it.

Author: Goh, WWB, Hui, HWH, Wong, L, Goh, WWB, Hui, HWH, and Wong, L
Abstract: In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MVI interactions are not well studied, they are ultimately interdependent. Batch sensitization can improve the quality of MVI. Conversely, accounting for missingness also improves proper BE estimation in BEC. Here, we discuss how BEC and MVI are interconnected and interdependent. We show how batch sensitization can improve any MVI and bring attention to the idea of BE-associated missing values (BEAMs). Finally, we discuss how batch-class imbalance problems can be mitigated by borrowing ideas from machine learning.
Published: 2023

5. Evaluating network-based missing protein prediction using p-values, Bayes Factors, and probabilities.

Author: Goh, WWB, Kong, W, Wong, L, Goh, WWB, Kong, W, and Wong, L
Abstract: Some prediction methods use probability to rank their predictions, while some other prediction methods do not rank their predictions and instead use [Formula: see text]-values to support their predictions. This disparity renders direct cross-comparison of these two kinds of methods difficult. In particular, approaches such as the Bayes Factor upper Bound (BFB) for [Formula: see text]-value conversion may not make correct assumptions for this kind of cross-comparisons. Here, using a well-established case study on renal cancer proteomics and in the context of missing protein prediction, we demonstrate how to compare these two kinds of prediction methods using two different strategies. The first strategy is based on false discovery rate (FDR) estimation, which does not make the same naïve assumptions as BFB conversions. The second strategy is a powerful approach which we colloquially call "home ground testing". Both strategies perform better than BFB conversions. Thus, we recommend comparing prediction methods by standardization to a common performance benchmark such as a global FDR. And where this is not possible, we recommend reciprocal "home ground testing".
Published: 2023

6. Resolving missing protein problems using functional class scoring

Author: Wong, BJH, Kong, W, Wong, L, and Goh, WWB
Subjects: Proteomics, Biomedical Research, Research Design, Proteins, Peptides
Abstract: Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in "data holes". These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples, hindering biomarker discovery and proper functional characterization. Network-based approaches can provide powerful solutions for resolving these issues. Functional Class Scoring (FCS) is one such method that uses protein complex information to recover missing proteins with weak support. However, FCS has not been evaluated on more recent proteomic technologies with higher coverage, and there is no clear way to evaluate its performance. To address these issues, we devised a more rigorous evaluation schema based on cross-verification between technical replicates and evaluated its performance on data acquired under recent Data-Independent Acquisition (DIA) technologies (viz. SWATH). Although cross-replicate examination reveals some inconsistencies amongst same-class samples, tissue-differentiating signal is nonetheless strongly conserved, confirming that FCS selects for biologically meaningful networks. We also report that predicted missing proteins are statistically significant based on FCS p values. Despite limited cross-replicate verification rates, the predicted missing proteins as a whole have higher peptide support than non-predicted proteins. FCS also predicts missing proteins that are often lost due to weak specific peptide support.
Published: 2022

7. Proteomic datasets of HeLa and SiHa cell lines acquired by DDA-PASEF and diaPASEF.

Author: Huang, Z, Kong, W, Wong, BJ, Gao, H, Guo, T, Liu, X, Du, X, Wong, L, Goh, WWB, Huang, Z, Kong, W, Wong, BJ, Gao, H, Guo, T, Liu, X, Du, X, Wong, L, and Goh, WWB
Abstract: We present four datasets on proteomics profiling of HeLa and SiHa cell lines associated with the research described in the paper "PROTREC: A probability-based approach for recovering missing proteins based on biological networks" [1]. Proteins in each cell line were acquired by two different data acquisition methods. The first was Data Dependent Acquisition-Parallel Accumulation Serial Fragmentation (DDA-PASEF) and the second was Parallel Accumulation-Serial Fragmentation combined with data-independent acquisition (diaPASEF) [2], [3]. Protein assembly was performed following search against the Swiss-Prot Human database using Peaks Studio for DDA datasets and Spectronaut for DIA datasets. The assembled result contains identified PSMs, peptides and proteins that are above threshold for each HeLa and SiHa sample. Coverage-wise, for DDA-PASEF, approximately 6,090 and 7,298 proteins were quantified for HeLa and SiHA sample, while13,339 and 8,773 proteins were quantified by diaPASEF for HeLa for SiHa sample, respectively. Consistency-wise, diaPASEF has fewer missing values (∼ 2%) compared to its DDA counterparts (∼5-7%). The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the iProX partner repository [4] with the dataset identifier PXD029773.
Published: 2022

8. PROTREC: A probability-based approach for recovering missing proteins based on biological networks.

Author: Kong, W, Wong, BJH, Gao, H, Guo, T, Liu, X, Du, X, Wong, L, Goh, WWB, Kong, W, Wong, BJH, Gao, H, Guo, T, Liu, X, Du, X, Wong, L, and Goh, WWB
Abstract: A novel network-based approach for predicting missing proteins (MPs) is proposed here. This approach, PROTREC (short for PROtein RECovery), dominates existing network-based methods - such as Functional Class Scoring (FCS), Hypergeometric Enrichment (HE), and Gene Set Enrichment Analysis (GSEA) - across a variety of proteomics datasets derived from different proteomics data acquisition paradigms: Higher PROTREC scores are much more closely correlated with higher recovery rates of MPs across sample replicates. The PROTREC score, unlike methods reporting p-values, can be directly interpreted as the probability that an unreported protein in a proteomic screen is actually present in the sample being screened. SIGNIFICANCE: Mass spectrometry (MS) has developed rapidly in recent years; however, an obvious proportion of proteins is still undetected, leading to missing protein problems. A few existing protein recovery methods are based on biological networks, but the performance is not satisfactory. We propose a new protein recovery method, PROTREC, a Bayesian-inspired approach based on biological networks, which shows exceptional performance across multiple validation strategies. It does not rely on peptide information, so it avoids the ambiguity issue that most protein assembly methods face.
Published: 2022

9. How doppelgänger effects in biomedical data confound machine learning.

Author: Wang, LR, Wong, L, Goh, WWB, Wang, LR, Wong, L, and Goh, WWB
Abstract: Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.
Published: 2022

10. Are batch effects still relevant in the age of big data?

Author: Goh, WWB, Yong, CH, Wong, L, Goh, WWB, Yong, CH, and Wong, L
Abstract: Batch effects (BEs) are technical biases that may confound analysis of high-throughput biotechnological data. BEs are complex and effective mitigation is highly context-dependent. In particular, the advent of high-resolution technologies such as single-cell RNA sequencing presents new challenges. We first cover how BE modeling differs between traditional datasets and the new data landscape. We also discuss new approaches for measuring and mitigating BEs, including whether a BE is significant enough to warrant correction. Even with the advent of machine learning and artificial intelligence, the increased complexity of next-generation biotechnological data means increased complexities in BE management. We forecast that BEs will not only remain relevant in the age of big data but will become even more important.
Published: 2022

11. Avoid Oversimplifications in Machine Learning: Going beyond the Class-Prediction Accuracy

Author: Ho SY, Wong L, and Goh WWB
Abstract: Class-prediction accuracy provides a quick but superficial way of determining classifier performance. It does not inform on the reproducibility of the findings or whether the selected or constructed features used are meaningful and specific. Furthermore, the class-prediction accuracy oversummarizes and does not inform on how training and learning have been accomplished: two classifiers providing the same performance in one validation can disagree on many future validations. It does not provide explainability in its decision-making process and is not objective, as its value is also affected by class proportions in the validation set. Despite these issues, this does not mean we should omit the class-prediction accuracy. Instead, it needs to be enriched with accompanying evidence and tests that supplement and contextualize the reported accuracy. This additional evidence serves as augmentations and can help us perform machine learning better while avoiding naive reliance on oversimplified metrics.
Published: 2020

12. The Birth of Bio-data Science: Trends, Expectations, and Applications

Author: Goh WWB and Wong L
Subjects: Bioinformatics, 01 Mathematical Sciences, 06 Biological Sciences, 08 Information and Computing Sciences
Published: 2020

13. How to do quantile normalization correctly for gene expression data analyses.

Author: Zhao, Y, Wong, L, Goh, WWB, Zhao, Y, Wong, L, and Goh, WWB
Abstract: Quantile normalization is an important normalization technique commonly used in high-dimensional data analysis. However, it is susceptible to class-effect proportion effects (the proportion of class-correlated variables in a dataset) and batch effects (the presence of potentially confounding technical variation) when applied blindly on whole data sets, resulting in higher false-positive and false-negative rates. We evaluate five strategies for performing quantile normalization, and demonstrate that good performance in terms of batch-effect correction and statistical feature selection can be readily achieved by first splitting data by sample class-labels before performing quantile normalization independently on each split ("Class-specific"). Via simulations with both real and simulated batch effects, we demonstrate that the "Class-specific" strategy (and others relying on similar principles) readily outperform whole-data quantile normalization, and is robust-preserving useful signals even during the combined analysis of separately-normalized datasets. Quantile normalization is a commonly used procedure. But when carelessly applied on whole datasets without first considering class-effect proportion and batch effects, can result in poor performance. If quantile normalization must be used, then we recommend using the "Class-specific" strategy.
Published: 2020

14. Proteomic investigation of intra-tumor heterogeneity using network-based contextualization - A case study on prostate cancer

Author: Goh WWB, Zhao Y, Sue AC-H, Guo T, and Wong L
Subjects: Biochemistry & Molecular Biology, 0601 Biochemistry and Cell Biology, 0607 Plant Biology, Analytical Chemistry
Abstract: Cancer is a heterogeneous disease, confounding the identification of relevant markers and drug targets. Network-based analysis is robust against noise, potentially offering a promising approach towards biomarker identification. We describe here the application of two network-based methods, qPSP (Quantitative Proteomics Signature Profiling) and PFSNet (Paired Fuzzy SubNetworks), in an intra-tissue proteome data set of prostate tissue samples. Despite high basal variation, we find that traditional statistical analysis may exaggerate the extent of heterogeneity. We also report that network-based analysis outperforms protein-based feature selection with concomitantly higher cross-validation accuracy. Overall, network-based analysis provides emergent signal that boosts sensitivity while retaining good precision. It is a potential means of circumventing heterogeneity for stable biomarker discovery.
Published: 2019

15. Turning straw into gold: building robustness into gene signature inference.

Author: Goh WWB, Wong L, Goh WWB, and Wong L
Abstract: Reproducible and generalizable gene signatures are essential for clinical deployment, but are hard to come by. The primary issue is insufficient mitigation of confounders: ensuring that hypotheses are appropriate, test statistics and null distributions are appropriate, and so on. To further improve robustness, additional good analytical practices (GAPs) are needed, namely: leveraging existing data and knowledge; careful and systematic evaluation of gene sets, even if they overlap with known sources of confounding; and rigorous testing of inferred signatures against as many published data sets as possible. Here, using a re-examination of a breast cancer data set and 48 published signatures, we illustrate the value of adopting these GAPs.
Published: 2019

16. Turning straw into gold: building robustness into gene signature inference

Author: Goh WWB and Wong L
Subjects: Gene Expression Regulation, Neoplastic, ComputingMethodologies_PATTERNRECOGNITION, Meta-Analysis as Topic, Medicinal & Biomolecular Chemistry, Gene Expression Profiling, 0601 Biochemistry and Cell Biology, 1115 Pharmacology and Pharmaceutical Sciences, Humans, Breast Neoplasms, Female, Confounding Factors, Epidemiologic, Confounding Factors (Epidemiology)
Abstract: Reproducible and generalizable gene signatures are essential for clinical deployment, but are hard to come by. The primary issue is insufficient mitigation of confounders: ensuring that hypotheses are appropriate, test statistics and null distributions are appropriate, and so on. To further improve robustness, additional good analytical practices (GAPs) are needed, namely: leveraging existing data and knowledge; careful and systematic evaluation of gene sets, even if they overlap with known sources of confounding; and rigorous testing of inferred signatures against as many published data sets as possible. Here, using a re-examination of a breast cancer data set and 48 published signatures, we illustrate the value of adopting these GAPs.
Published: 2018

17. Fuzzy-FishNet: A highly precise distribution-free network approach for feature selection in clinical proteomics

Author: Goh Wwb
Subjects: Distribution free, Computer science, Feature selection, Data mining, Proteomics, computer.software_genre, Fuzzy logic, computer, Network approach
Abstract: Network-based analysis methods can help resolve coverage and inconsistency issues in proteomics data. Previously, it was demonstrated that a suite of rank-based network approaches (RBNAs) provides unparalleled consistency and reliable feature selection. However, reliance on the t-statistic/t-distribution and hypersensitivity (coupled to a relatively flat p-value distribution) makes feature prioritization for validation difficult. To address these concerns, a refinement based on the fuzzified Fisher exact test, Fuzzy-FishNet was developed. Fuzzy-FishNet is highly precise (providing probability values that allows exact ranking of features). Furthermore, feature ranks are stable, even in small sample size scenario. Comparison of features selected by genomics and proteomics data respectively revealed that in spite of relative feature stability, cross-platform overlaps are extremely limited, suggesting that networks may not be the answer towards bridging the proteomics-genomics divide.
Published: 2015

18. Overcoming analytical reliability issues in clinical proteomics using rank-based network approaches

Author: Lik-Wei Wong and Goh Wwb
Subjects: Clinical biomarker, Computer science, Rank (computer programming), Proteome, Stability (learning theory), Noise (video), Data mining, computer.software_genre, Proteomics, computer, Biological network, Reliability (statistics)
Abstract: Proteomics is poised to play critical roles in clinical research. However, due to limited coverage and high noise, integration with powerful analysis algorithms is necessary. In particular, network-based algorithms can improve selection of reproducible features in spite of incomplete proteome coverage, technical inconsistency or high inter-sample variability. We define analytical reliability on three benchmarks --- precision/recall rates, feature-selection stability and cross-validation accuracy. Using these, we demonstrate the insufficiencies of commonly used Student???s t-test and Hypergeometric enrichment. Given advances in sample sizes, quantitation accuracy and coverage, we are now able to introduce and evaluate Ranked-Based Network Approaches (RBNAs) for the first time in proteomics. These include SNET (SubNETwork), FSNET (FuzzySNET), PFSNET (PairedFSNET). We also introduce for the first time, PPFSNET(samplePairedPFSNET), which is a paired-sample variant of PFSNET. RBNAs (particularly PFSNET and PPFSNET) excelled on all three benchmarks and can make consistent and reproducible predictions even in the small-sample size scenario (n=4). Given these qualities, RBNAs represent an important advancement in network biology, and is expected to see practical usage, particularly in clinical biomarker and drug target prediction.
Published: 2015
Full Text: View/download PDF

19. A generalisability theory approach to quantifying changes in psychopathology among ultra-high-risk individuals for psychosis.

Author: Doborjeh Z, N Medvedev O, Doborjeh M, Singh B, Sumich A, Budhraja S, Goh WWB, Lee J, Williams M, M-K Lai E, and Kasabov N
Abstract: Distinguishing stable and fluctuating psychopathological features in young individuals at Ultra High Risk (UHR) for psychosis is challenging, but critical for building robust, accurate, early clinical detection and prevention capabilities. Over a 24-month period, 159 UHR individuals were assessed using the Positive and Negative Symptom Scale (PANSS). Generalisability Theory was used to validate the PANSS with this population and to investigate stable and fluctuating features, by estimating the reliability and generalisability of three factor (Positive, Negative, and General) and five factor (Positive, Negative, Cognitive, Depression, and Hostility) symptom models. Acceptable reliability and generalisability of scores across occasions and sample population were demonstrated by the total PANSS scale (Gr = 0.85). Fluctuating symptoms (delusions, hallucinatory behaviour, lack of spontaneity, flow in conversation, emotional withdrawal, and somatic concern) showed high variability over time, with 50-68% of the variance explained by individual transient states. In contrast, more stable symptoms included excitement, poor rapport, anxiety, guilt feeling, uncooperativeness, and poor impulse control. The 3-factor model of PANSS and its subscales showed robust reliability and generalisability of their assessment scores across the UHR population and evaluation periods (G = 0.77-0.93), offering a suitable means to assess psychosis risk. Certain subscales within the 5-factor PANSS model showed comparatively lower reliability and generalisability (G = 0.33-0.66). The identified and investigated fluctuating symptoms in UHR individuals are more amendable by means of intervention, which could have significant implications for preventing and addressing psychosis. Prioritising the treatment of fluctuating symptoms could enhance intervention efficacy, offering a sharper focus in clinical trials. At the same time, using more reliable total scale and 3 subscales can contribute to more accurate assessment of enduring psychosis patterns in clinical and experimental settings., (© 2024. The Author(s).)
Published: 2024
Full Text: View/download PDF

20. Thinking points for effective batch correction on biomedical data.

Author: Hui HWH, Kong W, and Goh WWB
Subjects: Humans, Artificial Intelligence, Biomedical Research, Computational Biology methods, Reproducibility of Results, Algorithms
Abstract: Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive considerations. This paper underscores the necessity of a flexible and holistic approach for selecting batch effect correction algorithms (BECAs), advocating for proper BECA evaluations and consideration of artificial intelligence-based strategies. We also discuss key challenges in batch effect correction, including the importance of uncovering hidden batch factors and understanding the impact of design imbalance, missing values, and aggressive correction. Our aim is to provide researchers with a robust framework for effective batch effects management and enhancing the reliability of high-dimensional data analyses., (© The Author(s) 2024. Published by Oxford University Press.)
Published: 2024
Full Text: View/download PDF

21. Ten quick tips for ensuring machine learning model validity.

Author: Goh WWB, Kabir MN, Yoo S, and Wong L
Subjects: Humans, Artificial Intelligence, Reproducibility of Results, Algorithms, Machine Learning, Computational Biology methods
Abstract: Artificial Intelligence (AI) and Machine Learning (ML) models are increasingly deployed on biomedical and health data to shed insights on biological mechanism, predict disease outcomes, and support clinical decision-making. However, ensuring model validity is challenging. The 10 quick tips described here discuss useful practices on how to check AI/ML models from 2 perspectives-the user and the developer., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2024 Goh et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
Published: 2024
Full Text: View/download PDF

22. OLB-AC: toward optimizing ligand bioactivities through deep graph learning and activity cliffs.

Author: Yin Y, Hu H, Yang J, Ye C, Goh WWB, Kong AW, and Wu J
Subjects: Ligands, Neural Networks, Computer, Drug Discovery methods, Deep Learning
Abstract: Motivation: Deep graph learning (DGL) has been widely employed in the realm of ligand-based virtual screening. Within this field, a key hurdle is the existence of activity cliffs (ACs), where minor chemical alterations can lead to significant changes in bioactivity. In response, several DGL models have been developed to enhance ligand bioactivity prediction in the presence of ACs. Yet, there remains a largely unexplored opportunity within ACs for optimizing ligand bioactivity, making it an area ripe for further investigation., Results: We present a novel approach to simultaneously predict and optimize ligand bioactivities through DGL and ACs (OLB-AC). OLB-AC possesses the capability to optimize ligand molecules located near ACs, providing a direct reference for optimizing ligand bioactivities with the matching of original ligands. To accomplish this, a novel attentive graph reconstruction neural network and ligand optimization scheme are proposed. Attentive graph reconstruction neural network reconstructs original ligands and optimizes them through adversarial representations derived from their bioactivity prediction process. Experimental results on nine drug targets reveal that out of the 667 molecules generated through OLB-AC optimization on datasets comprising 974 low-activity, noninhibitor, or highly toxic ligands, 49 are recognized as known highly active, inhibitor, or nontoxic ligands beyond the datasets' scope. The 27 out of 49 matched molecular pairs generated by OLB-AC reveal novel transformations not present in their training sets. The adversarial representations employed for ligand optimization originate from the gradients of bioactivity predictions. Therefore, we also assess OLB-AC's prediction accuracy across 33 different bioactivity datasets. Results show that OLB-AC achieves the best Pearson correlation coefficient (r2) on 27/33 datasets, with an average improvement of 7.2%-22.9% against the state-of-the-art bioactivity prediction methods., Availability and Implementation: The code and dataset developed in this work are available at github.com/Yueming-Yin/OLB-AC., (© The Author(s) 2024. Published by Oxford University Press.)
Published: 2024
Full Text: View/download PDF

23. Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.

Author: Peng H, Wang H, Kong W, Li J, and Goh WWB
Subjects: Workflow, Machine Learning, Proteome metabolism, Humans, Algorithms, Databases, Protein, Proteomics methods
Abstract: Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows., (© 2024. The Author(s).)
Published: 2024
Full Text: View/download PDF

24. Clinical use cases in artificial intelligence: current trends and future opportunities.

Author: Tan CH, Goh WWB, So JBY, and Sung JJY
Subjects: Humans, Forecasting, Artificial Intelligence
Published: 2024
Full Text: View/download PDF

25. Optimizing the PROTREC network-based missing protein prediction algorithm.

Author: Wu W, Huang Z, Kong W, Peng H, and Goh WWB
Subjects: Humans, Algorithms, Neoplasms
Abstract: This article summarizes the PROTREC method and investigates the impact that the different hyper-parameters have on the task of missing protein prediction using PROTREC. We evaluate missing protein recovery rates using different PROTREC score selection approaches (MAX, MIN, MEDIAN, and MEAN), different PROTREC score thresholds, as well as different complex size thresholds. In addition, we included two additional cancer datasets in our analysis and introduced a new validation method to check both the robustness of the PROTREC method as well as the correctness of our analysis. Our analysis showed that the missing protein recovery rate can be improved by adopting PROTREC score selection operations of MIN, MEDIAN, and MEAN instead of the default MAX. However, this may come at a cost of reduced numbers of proteins predicted and validated. The users should therefore choose their hyper-parameters carefully to find a balance in the accuracy-quantity trade-off. We also explored the possibility of combining PROTREC with a p-value-based method (FCS) and demonstrated that PROTREC is able to perform well independently without any help from a p-value-based method. Furthermore, we conducted a downstream enrichment analysis to understand the biological pathways and protein networks within the cancerous tissues using the recovered proteins. Missing protein recovery rate using PROTREC can be improved by selecting a different PROTREC score selection method. Different PROTREC score selection methods and other hyper-parameters such as PROTREC score threshold and complex size threshold introduce accuracy-quantity trade-off. PROTREC is able to perform well independently of any filtering using a p-value-based method. Verification of the PROTREC method on additional cancer datasets. Downstream Enrichment Analysis to understand the biological pathways and protein networks in cancerous tissues., (© 2023 Wiley-VCH GmbH.)
Published: 2024
Full Text: View/download PDF

26. MultiPro: DDA-PASEF and diaPASEF acquired cell line proteomic datasets with deliberate batch effects.

Author: Wang H, Lim KP, Kong W, Gao H, Wong BJH, Phua SX, Guo T, and Goh WWB
Subjects: Algorithms, Mass Spectrometry, Humans, Cell Line, Proteome metabolism, Proteomics
Abstract: Mass spectrometry-based proteomics plays a critical role in current biological and clinical research. Technical issues like data integration, missing value imputation, batch effect correction and the exploration of inter-connections amongst these technical issues, can produce errors but are not well studied. Although proteomic technologies have improved significantly in recent years, this alone cannot resolve these issues. What is needed are better algorithms and data processing knowledge. But to obtain these, we need appropriate proteomics datasets for exploration, investigation, and benchmarking. To meet this need, we developed MultiPro (Multi-purpose Proteome Resource), a resource comprising four comprehensive large-scale proteomics datasets with deliberate batch effects using the latest parallel accumulation-serial fragmentation in both Data-Dependent Acquisition (DDA) and Data Independent Acquisition (DIA) modes. Each dataset contains a balanced two-class design based on well-characterized and widely studied cell lines (A549 vs K562 or HCC1806 vs HS578T) with 48 or 36 biological and technical replicates altogether, allowing for investigation of a multitude of technical issues. These datasets allow for investigation of inter-connections between class and batch factors, or to develop approaches to compare and integrate data from DDA and DIA platforms., (© 2023. The Author(s).)
Published: 2023
Full Text: View/download PDF

27. Modeling the influence of attitudes, trust, and beliefs on endoscopists' acceptance of artificial intelligence applications in medical practice.

Author: Schulz PJ, Lwin MO, Kee KM, Goh WWB, Lam TYT, and Sung JJY
Subjects: Adult, Female, Humans, Male, Educational Status, Policy, Technology, Gastroenterology, Endoscopy, Artificial Intelligence, Trust
Abstract: Introduction: The potential for deployment of Artificial Intelligence (AI) technologies in various fields of medicine is vast, yet acceptance of AI amongst clinicians has been patchy. This research therefore examines the role of antecedents, namely trust, attitude, and beliefs in driving AI acceptance in clinical practice., Methods: We utilized online surveys to gather data from clinicians in the field of gastroenterology., Results: A total of 164 participants responded to the survey. Participants had a mean age of 44.49 (SD = 9.65). Most participants were male ( n = 116, 70.30%) and specialized in gastroenterology ( n = 153, 92.73%). Based on the results collected, we proposed and tested a model of AI acceptance in medical practice. Our findings showed that while the proposed drivers had a positive impact on AI tools' acceptance, not all effects were direct. Trust and belief were found to fully mediate the effects of attitude on AI acceptance by clinicians., Discussion: The role of trust and beliefs as primary mediators of the acceptance of AI in medical practice suggest that these should be areas of focus in AI education, engagement and training. This has implications for how AI systems can gain greater clinician acceptance to engender greater trust and adoption amongst public health systems and professional networks which in turn would impact how populations interface with AI. Implications for policy and practice, as well as future research in this nascent field, are discussed., Competing Interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest., (Copyright © 2023 Schulz, Lwin, Kee, Goh, Lam and Sung.)
Published: 2023
Full Text: View/download PDF

28. RNA-sequencing of peripheral whole blood of individuals at ultra-high-risk for psychosis - A longitudinal perspective.

Author: Tan SMX, Yee JY, Budhraja S, Singh B, Doborjeh Z, Doborjeh M, Kasabov N, Lai E, Sumich A, Lee J, and Goh WWB
Subjects: Adolescent, Humans, Longitudinal Studies, Biomarkers, RNA, Psychotic Disorders genetics, Schizophrenia genetics
Abstract: Background: The peripheral blood is an attractive source of prognostic biomarkers for psychosis conversion. There is limited research on the transcriptomic changes associated with psychosis conversion in the peripheral whole blood., Study Design: We performed RNA-sequencing of peripheral whole blood from 65 ultra-high-risk (UHR) participants and 70 healthy control participants recruited in the Longitudinal Youth-at-Risk Study (LYRIKS) cohort. 13 UHR participants converted in the study duration. Samples were collected at 3 timepoints, at 12-months interval across a 2-year period. We examined whether the genes differential with psychosis conversion contain schizophrenia risk loci. We then examined the functional ontologies and GWAS associations of the differential genes. We also identified the overlap between differentially expressed genes across different comparisons., Study Results: Genes containing schizophrenia risk loci were not differentially expressed in the peripheral whole blood in psychosis conversion. The differentially expressed genes in psychosis conversion are enriched for ontologies associated with cellular replication. The differentially expressed genes in psychosis conversion are associated with non-neurological GWAS phenotypes reported to be perturbed in schizophrenia and psychosis but not schizophrenia and psychosis phenotypes themselves. We found minimal overlap between the genes differential with psychosis conversion and the genes that are differential between pre-conversion and non-conversion samples., Conclusion: The associations between psychosis conversion and peripheral blood-based biomarkers are likely to be indirect. Further studies to elucidate the mechanism behind potential indirect associations are needed., Competing Interests: Declaration of Competing Interest The authors declare no conflict of interest., (Copyright © 2023. Published by Elsevier B.V.)
Published: 2023
Full Text: View/download PDF

29. Data pre-processing for analyzing microbiome data - A mini review.

Author: Zhou R, Ng SK, Sung JJY, Goh WWB, and Wong SH
Abstract: The human microbiome is an emerging research frontier due to its profound impacts on health. High-throughput microbiome sequencing enables studying microbial communities but suffers from analytical challenges. In particular, the lack of dedicated preprocessing methods to improve data quality impedes effective minimization of biases prior to downstream analysis. This review aims to address this gap by providing a comprehensive overview of preprocessing techniques relevant to microbiome research. We outline a typical workflow for microbiome data analysis. Preprocessing methods discussed include quality filtering, batch effect correction, imputation of missing values, normalization, and data transformation. We highlight strengths and limitations of each technique to serve as a practical guide for researchers and identify areas needing further methodological development. Establishing robust, standardized preprocessing will be essential for drawing valid biological conclusions from microbiome studies., Competing Interests: The authors of this manuscript declare no conflict of interests., (© 2023 The Authors.)
Published: 2023
Full Text: View/download PDF

30. How missing value imputation is confounded with batch effects and what you can do about it.

Author: Goh WWB, Hui HWH, and Wong L
Subjects: Electronic Data Processing
Abstract: In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MVI interactions are not well studied, they are ultimately interdependent. Batch sensitization can improve the quality of MVI. Conversely, accounting for missingness also improves proper BE estimation in BEC. Here, we discuss how BEC and MVI are interconnected and interdependent. We show how batch sensitization can improve any MVI and bring attention to the idea of BE-associated missing values (BEAMs). Finally, we discuss how batch-class imbalance problems can be mitigated by borrowing ideas from machine learning., (Copyright © 2023 Elsevier Ltd. All rights reserved.)
Published: 2023
Full Text: View/download PDF

31. ProJect: a powerful mixed-model missing value imputation method.

Author: Kong W, Wong BJH, Hui HWH, Lim KP, Wang Y, Wong L, and Goh WWB
Subjects: Bayes Theorem, Oligonucleotide Array Sequence Analysis methods, Mass Spectrometry methods, Algorithms, Genomics
Abstract: Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is available at https://github.com/miaomiao6606/ProJect., (© The Author(s) 2023. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
Published: 2023
Full Text: View/download PDF

32. A novel survival prediction signature outperforms PAM50 and artificial intelligence-based feature-selection methods.

Author: Foo RJK, Tian S, Tan EY, and Goh WWB
Subjects: Humans, Female, Gene Expression Profiling methods, Algorithms, Artificial Intelligence, Breast Neoplasms genetics
Abstract: The robustness of a breast cancer gene signature, the super-proliferation set (SPS), is initially tested and investigated on breast cancer cell lines from the Cancer Cell Line Encyclopaedia (CCLE). Previously, SPS was derived via a meta-analysis of 47 independent breast cancer gene signatures, benchmarked on survival information from clinical data in the NKI dataset. Here, relying on the stability of cell line data and associative prior knowledge, we first demonstrate through Principal Component Analysis (PCA) that SPS prioritizes survival information over secondary subtype information, surpassing both PAM50 and Boruta, an artificial intelligence-based feature-selection algorithm, in this regard. We can also extract higher resolution 'progression' information using SPS, dividing survival outcomes into several clinically relevant stages ('good', 'intermediate', and 'bad) based on different quadrants of the PCA scatterplot. Furthermore, by transferring these 'progression' annotations onto independent clinical datasets, we demonstrate the generalisability of our method on actual patient data. Finally, via the characteristic genetic profiles of each quadrant/stage, we identified efficacious drugs using their gene reversal scores that can shift signatures across quadrants/stages, in a process known as gene signature reversal. This confirms the power of meta-analytical approaches for gene signature inference in breast cancer, as well as the clinical benefit in translating these inferences onto real-world patient data for more targeted therapies., Competing Interests: Conflict of interest The authors declare no conflicting interests, financial or otherwise., (Copyright © 2023 Elsevier Ltd. All rights reserved.)
Published: 2023
Full Text: View/download PDF

33. PROSE: phenotype-specific network signatures from individual proteomic samples.

Author: Wong BJH, Kong W, Peng H, and Goh WWB
Subjects: Proteomics methods, Proteome analysis
Abstract: Proteomic studies characterize the protein composition of complex biological samples. Despite recent advancements in mass spectrometry instrumentation and computational tools, low proteome coverage and interpretability remains a challenge. To address this, we developed Proteome Support Vector Enrichment (PROSE), a fast, scalable and lightweight pipeline for scoring proteins based on orthogonal gene co-expression network matrices. PROSE utilizes simple protein lists as input, generating a standard enrichment score for all proteins, including undetected ones. In our benchmark with 7 other candidate prioritization techniques, PROSE shows high accuracy in missing protein prediction, with scores correlating strongly to corresponding gene expression data. As a further proof-of-concept, we applied PROSE to a reanalysis of the Cancer Cell Line Encyclopedia proteomics dataset, where it captures key phenotypic features, including gene dependency. We lastly demonstrated its applicability on a breast cancer clinical dataset, showing clustering by annotated molecular subtype and identification of putative drivers of triple-negative breast cancer. PROSE is available as a user-friendly Python module from https://github.com/bwbio/PROSE., (© The Author(s) 2023. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
Published: 2023
Full Text: View/download PDF

34. ProInfer: An interpretable protein inference tool leveraging on biological networks.

Author: Peng H, Wong L, and Goh WWB
Subjects: Proteome analysis, Mass Spectrometry, Proteomics methods, Databases, Protein, Software, Algorithms, Peptides chemistry
Abstract: In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide which maps to more than one candidate protein) and enhances the statistical confidence for proteins supported by both unique and ambiguous peptides. Consequently, ProInfer rescues weakly supported proteins thereby improving proteome coverage. Evaluated across THP1 cell line, lung cancer and RAW267.4 datasets, ProInfer always infers the most numbers of true positives, in comparison to mainstream protein inference tools Fido, EPIFANY and PIA. ProInfer is also adept at retrieving differentially expressed proteins, signifying its usefulness for functional analysis and phenotype profiling. Source codes of ProInfer are available at https://github.com/PennHui2016/ProInfer., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2023 Peng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
Published: 2023
Full Text: View/download PDF

35. The importance of batch sensitization in missing value imputation.

Author: Hui HWH, Kong W, Peng H, and Goh WWB
Subjects: Proteomics, Data Interpretation, Statistical, Algorithms, Genomics
Abstract: Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising as missing values are imputed during early pre-processing while batch effects are mitigated during late pre-processing, prior to functional analysis. Unless actively managed, MVI approaches generally ignore the batch covariate, with unknown consequences. We examine this problem by modelling three simple imputation strategies: global (M1), self-batch (M2) and cross-batch (M3) first via simulations, and then corroborated on real proteomics and genomics data. We report that explicit consideration of batch covariates (M2) is important for good outcomes, resulting in enhanced batch correction and lower statistical errors. However, M1 and M3 are error-generating: global and cross-batch averaging may result in batch-effect dilution, with concomitant and irreversible increase in intra-sample noise. This noise is unremovable via batch correction algorithms and produces false positives and negatives. Hence, careless imputation in the presence of non-negligible covariates such as batch effects should be avoided., (© 2023. The Author(s).)
Published: 2023
Full Text: View/download PDF

36. Author Correction: Activation function 1 of progesterone receptor is required for progesterone antagonism of oestrogen action in the uterus.

Author: Lee SH, Lim CL, Shen W, Tan SMX, Woo ARE, Yap YHY, Sian CAS, Goh WWB, Yu WP, Li L, and Lin VCL
Published: 2023
Full Text: View/download PDF

37. Evaluating network-based missing protein prediction using p -values, Bayes Factors, and probabilities.

Author: Goh WWB, Kong W, and Wong L
Subjects: Bayes Theorem, Probability, Proteomics, Proteins
Abstract: Some prediction methods use probability to rank their predictions, while some other prediction methods do not rank their predictions and instead use [Formula: see text]-values to support their predictions. This disparity renders direct cross-comparison of these two kinds of methods difficult. In particular, approaches such as the Bayes Factor upper Bound (BFB) for [Formula: see text]-value conversion may not make correct assumptions for this kind of cross-comparisons. Here, using a well-established case study on renal cancer proteomics and in the context of missing protein prediction, we demonstrate how to compare these two kinds of prediction methods using two different strategies. The first strategy is based on false discovery rate (FDR) estimation, which does not make the same naïve assumptions as BFB conversions. The second strategy is a powerful approach which we colloquially call "home ground testing". Both strategies perform better than BFB conversions. Thus, we recommend comparing prediction methods by standardization to a common performance benchmark such as a global FDR. And where this is not possible, we recommend reciprocal "home ground testing".
Published: 2023
Full Text: View/download PDF

38. Emotional Variance Analysis: A new sentiment analysis feature set for Artificial Intelligence and Machine Learning applications.

Author: Tan L, Tan OK, Sze CC, and Goh WWB
Subjects: Humans, Machine Learning, Neural Networks, Computer, Attitude, Artificial Intelligence, Sentiment Analysis
Abstract: Sentiment Analysis (SA) is a category of data mining techniques that extract latent representations of affective states within textual corpuses. This has wide ranging applications from online reviews to capturing mental states. In this paper, we present a novel SA feature set; Emotional Variance Analysis (EVA), which captures patterns of emotional instability. Applying EVA on student journals garnered from an Experiential Learning (EL) course, we find that EVA is useful for profiling variations in sentiment polarity and intensity, which in turn can predict academic performance. As a feature set, EVA is compatible with a wide variety of Artificial Intelligence (AI) and Machine Learning (ML) applications. Although evaluated on education data, we foresee EVA to be useful in mental health profiling and consumer behaviour applications. EVA is available at https://qr.page/g/5jQ8DQmWQT4. Our results show that EVA was able to achieve an overall accuracy of 88.7% and outperform NLP (76.0%) and SentimentR (58.0%) features by 15.8% and 51.7% respectively when predicting student experiential learning grade scores through a Multi-Layer Perceptron (MLP) ML model., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2023 Tan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
Published: 2023
Full Text: View/download PDF

39. Dealing with missing values in proteomics data.

Author: Kong W, Hui HWH, Peng H, and Goh WWB
Subjects: Proteomics, Algorithms
Abstract: Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI., (© 2022 Wiley-VCH GmbH.)
Published: 2022
Full Text: View/download PDF

40. A novel pipeline for computerized mouse spermatogenesis staging.

Author: Lu H, Zang M, Marini GPL, Wang X, Jiao Y, Ao N, Ong K, Huo X, Li L, Xu EY, Goh WWB, Yu W, and Xu J
Subjects: Mice, Male, Animals, Seminiferous Tubules, Testis anatomy & histology, Semen, Spermatogenesis
Abstract: Motivation: Differentiating 12 stages of the mouse seminiferous epithelial cycle is vital towards understanding the dynamic spermatogenesis process. However, it is challenging since two adjacent spermatogenic stages are morphologically similar. Distinguishing Stages I-III from Stages IV-V is important for histologists to understand sperm development in wildtype mice and spermatogenic defects in infertile mice. To achieve this, we propose a novel pipeline for computerized spermatogenesis staging (CSS)., Results: The CSS pipeline comprises four parts: (i) A seminiferous tubule segmentation model is developed to extract every single tubule; (ii) A multi-scale learning (MSL) model is developed to integrate local and global information of a seminiferous tubule to distinguish Stages I-V from Stages VI-XII; (iii) a multi-task learning (MTL) model is developed to segment the multiple testicular cells for Stages I-V without an exhaustive requirement for manual annotation; (iv) A set of 204D image-derived features is developed to discriminate Stages I-III from Stages IV-V by capturing cell-level and image-level representation. Experimental results suggest that the proposed MSL and MTL models outperform classic single-scale and single-task models when manual annotation is limited. In addition, the proposed image-derived features are discriminative between Stages I-III and Stages IV-V. In conclusion, the CSS pipeline can not only provide histologists with a solution to facilitate quantitative analysis for spermatogenesis stage identification but also help them to uncover novel computerized image-derived biomarkers., Availability and Implementation: https://github.com/jydada/CSS., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
Published: 2022
Full Text: View/download PDF

41. Protocol to identify functional doppelgängers and verify biomedical gene expression data using doppelgangerIdentifier.

Author: Wang LR, Fan X, and Goh WWB
Subjects: Gene Expression genetics, Machine Learning, Software
Abstract: Functional doppelgängers (FDs) are independently derived sample pairs that confound machine learning model (ML) performance when assorted across training and validation sets. Here, we detail the use of doppelgangerIdentifier (DI), providing software installation, data preparation, doppelgänger identification, and functional testing steps. We demonstrate examples with biomedical gene expression data. We also provide guidelines for the selection of user-defined function arguments. For complete details on the use and execution of this protocol, please refer to Wang et al. (2022)., Competing Interests: The authors declare no competing interests., (© 2022 The Author(s).)
Published: 2022
Full Text: View/download PDF

42. DNA methylation levels of RELN promoter region in ultra-high risk, first episode and chronic schizophrenia cohorts of schizophrenia.

Author: Kho SH, Yee JY, Puang SJ, Han L, Chiang C, Rapisarda A, Goh WWB, Lee J, and Sng JCG
Abstract: The essential role of the Reelin gene (RELN) during brain development makes it a prominent candidate in human epigenetic studies of Schizophrenia. Previous literature has reported differing levels of DNA methylation (DNAm) in patients with psychosis. Therefore, this study aimed to (1) examine and compare RELN DNAm levels in subjects at different stages of psychosis cross-sectionally, (2) analyse the effect of antipsychotics (AP) on DNAm, and (3) evaluate the effectiveness and applicability of RELN promoter DNAm as a possible biological-based marker for symptom severity in psychosis.. The study cohort consisted of 56 healthy controls, 87 ultra-high risk (UHR) individuals, 26 first-episode (FE) psychosis individuals and 30 chronic schizophrenia (CS) individuals. The Positive and Negative Syndrome Scale (PANSS) was used to assess Schizophrenia severity. After pyrosequencing selected CpG sites of peripheral blood, the Average mean DNAm levels were compared amongst the 4 subgroups. Our results showed differing levels of DNAm, with UHR having the lowest (7.72 ± 0.19) while the CS had the highest levels (HC: 8.78 ± 0.35; FE: 7.75 ± 0.37; CS: 8.82 ± 0.48). Significantly higher Average mean DNAm levels were found in CS subjects on AP (9.12 ± 0.61) compared to UHR without medication (UHR(-)) (7.39 ± 0.18). A significant association was also observed between the Average mean DNAm of FE and PANSS Negative symptom factor (R 2 = 0.237, ß = -0.401, *p = 0.033). In conclusion, our findings suggested different levels of DNAm for subjects at different stages of psychosis. Those subjects that took AP have different DNAm levels. There were significant associations between FE DNAm and Negative PANSS scores. With more future experiments and on larger cohorts, there may be potential use of DNAm of the RELN gene as one of the genes for the biological-based marker for symptom severity in psychosis., (© 2022. The Author(s).)
Published: 2022
Full Text: View/download PDF

43. Activation function 1 of progesterone receptor is required for progesterone antagonism of oestrogen action in the uterus.

Author: Lee SH, Lim CL, Shen W, Tan SMX, Woo ARE, Yap YHY, Sian CAS, Goh WWB, Yu WP, Li L, and Lin VCL
Subjects: Animals, Chromatin metabolism, Endometrium metabolism, Estrogens metabolism, Estrogens pharmacology, Female, Furylfuramide metabolism, Furylfuramide pharmacology, Mice, Pregnancy, Receptors, Estrogen genetics, Receptors, Estrogen metabolism, Uterus metabolism, Progesterone metabolism, Progesterone pharmacology, Receptors, Progesterone genetics, Receptors, Progesterone metabolism
Abstract: Background: Progesterone receptor (PGR) is a master regulator of uterine function through antagonistic and synergistic interplays with oestrogen receptors. PGR action is primarily mediated by activation functions AF1 and AF2, but their physiological significance is unknown., Results: We report the first study of AF1 function in mice. The AF1 mutant mice are infertile with impaired implantation and decidualization. This is associated with a delay in the cessation of epithelial proliferation and in the initiation of stromal proliferation at preimplantation. Despite tissue selective effect on PGR target genes, AF1 mutations caused global loss of the antioestrogenic activity of progesterone in both pregnant and ovariectomized models. Importantly, the study provides evidence that PGR can exert an antioestrogenic effect by genomic inhibition of Esr1 and Greb1 expression. ChIP-Seq data mining reveals intermingled PGR and ESR1 binding on Esr1 and Greb1 gene enhancers. Chromatin conformation analysis shows reduced interactions in these genes' loci in the mutant, coinciding with their upregulations., Conclusion: AF1 mediates genomic inhibition of ESR1 action globally whilst it also has tissue-selective effect on PGR target genes., (© 2022. The Author(s).)
Published: 2022
Full Text: View/download PDF

44. Are batch effects still relevant in the age of big data?

Author: Goh WWB, Yong CH, and Wong L
Subjects: Machine Learning, Artificial Intelligence, Big Data
Abstract: Batch effects (BEs) are technical biases that may confound analysis of high-throughput biotechnological data. BEs are complex and effective mitigation is highly context-dependent. In particular, the advent of high-resolution technologies such as single-cell RNA sequencing presents new challenges. We first cover how BE modeling differs between traditional datasets and the new data landscape. We also discuss new approaches for measuring and mitigating BEs, including whether a BE is significant enough to warrant correction. Even with the advent of machine learning and artificial intelligence, the increased complexity of next-generation biotechnological data means increased complexities in BE management. We forecast that BEs will not only remain relevant in the age of big data but will become even more important., Competing Interests: Declaration of interests The authors declare no conflicting interest, financial or otherwise., (Copyright © 2022 Elsevier Ltd. All rights reserved.)
Published: 2022
Full Text: View/download PDF

45. Data considerations for predictive modeling applied to the discovery of bioactive natural products.

Author: Xue HT, Stanley-Baker M, Kong AWK, Li HL, and Goh WWB
Subjects: Drug Development, Drug Discovery methods, Artificial Intelligence, Biological Products pharmacology
Abstract: Natural products (NPs) constitute a large reserve of bioactive compounds useful for drug development. Recent advances in high-throughput technologies facilitate functional analysis of therapeutic effects and NP-based drug discovery. However, the large amount of generated data is complex and difficult to analyze effectively. This limitation is increasingly surmounted by artificial intelligence (AI) techniques but more needs to be done. Here, we present and discuss two crucial issues limiting NP-AI drug discovery: the first is on knowledge and resource development (data integration) to bridge the gap between NPs and functional or therapeutic effects. The second issue is on NP-AI modeling considerations, limitations and challenges., Competing Interests: Conflicts of interest The authors declare no competing interests, financial or otherwise., (Copyright © 2022 Elsevier Ltd. All rights reserved.)
Published: 2022
Full Text: View/download PDF

46. Doppelgänger spotting in biomedical gene expression data.

Author: Wang LR, Choy XY, and Goh WWB
Abstract: Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier , a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights., Competing Interests: The authors declare no conflicting interests, financial or otherwise., (© 2022 The Author(s).)
Published: 2022
Full Text: View/download PDF

47. Resolving missing protein problems using functional class scoring.

Author: Wong BJH, Kong W, Wong L, and Goh WWB
Subjects: Peptides, Proteins chemistry, Research Design, Biomedical Research, Proteomics methods
Abstract: Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in "data holes". These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples, hindering biomarker discovery and proper functional characterization. Network-based approaches can provide powerful solutions for resolving these issues. Functional Class Scoring (FCS) is one such method that uses protein complex information to recover missing proteins with weak support. However, FCS has not been evaluated on more recent proteomic technologies with higher coverage, and there is no clear way to evaluate its performance. To address these issues, we devised a more rigorous evaluation schema based on cross-verification between technical replicates and evaluated its performance on data acquired under recent Data-Independent Acquisition (DIA) technologies (viz. SWATH). Although cross-replicate examination reveals some inconsistencies amongst same-class samples, tissue-differentiating signal is nonetheless strongly conserved, confirming that FCS selects for biologically meaningful networks. We also report that predicted missing proteins are statistically significant based on FCS p values. Despite limited cross-replicate verification rates, the predicted missing proteins as a whole have higher peptide support than non-predicted proteins. FCS also predicts missing proteins that are often lost due to weak specific peptide support., (© 2022. The Author(s).)
Published: 2022
Full Text: View/download PDF

48. An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study.

Author: Zhang X, Lee J, and Goh WWB
Abstract: Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement., Competing Interests: The authors declare no conflict of interest., (© 2022 The Author(s).)
Published: 2022
Full Text: View/download PDF

49. How doppelgänger effects in biomedical data confound machine learning.

Author: Wang LR, Wong L, and Goh WWB
Subjects: Reproducibility of Results, Machine Learning
Abstract: Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split., Competing Interests: Declaration of interests The authors declare no competing interests., (Copyright © 2021 Elsevier Ltd. All rights reserved.)
Published: 2022
Full Text: View/download PDF

50. Proteomic datasets of HeLa and SiHa cell lines acquired by DDA-PASEF and diaPASEF.

Author: Huang Z, Kong W, Wong BJ, Gao H, Guo T, Liu X, Du X, Wong L, and Goh WWB
Abstract: We present four datasets on proteomics profiling of HeLa and SiHa cell lines associated with the research described in the paper "PROTREC: A probability-based approach for recovering missing proteins based on biological networks" [1]. Proteins in each cell line were acquired by two different data acquisition methods. The first was Data Dependent Acquisition-Parallel Accumulation Serial Fragmentation (DDA-PASEF) and the second was Parallel Accumulation-Serial Fragmentation combined with data-independent acquisition (diaPASEF) [2], [3]. Protein assembly was performed following search against the Swiss-Prot Human database using Peaks Studio for DDA datasets and Spectronaut for DIA datasets. The assembled result contains identified PSMs, peptides and proteins that are above threshold for each HeLa and SiHa sample. Coverage-wise, for DDA-PASEF, approximately 6,090 and 7,298 proteins were quantified for HeLa and SiHA sample, while13,339 and 8,773 proteins were quantified by diaPASEF for HeLa for SiHa sample, respectively. Consistency-wise, diaPASEF has fewer missing values (∼ 2%) compared to its DDA counterparts (∼5-7%). The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the iProX partner repository [4] with the dataset identifier PXD029773., Competing Interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper., (© 2022 The Author(s). Published by Elsevier Inc.)
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

89 results on '"Goh WWB"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources