Author: "Sheridan, Robert P." / Topic: quantitative structure-activity relationship - Searchworks@Jio Institute Digital Library Search Results

1. Stability of Prediction in Production ADMET Models as a Function of Version: Why and When Predictions Change.

Author: Sheridan RP
Subjects: Quantitative Structure-Activity Relationship
Abstract: As with other pharma companies, we maintain production QSAR models of ADMET end points and update them regularly. Here, for six ADMET end points, we examine the predictions of test set molecules on multiple versions of random forest models spanning a period of 10 years. For any given end point, the predictions for the majority of molecules are similar for all model versions. However, for a small minority of molecules, the prediction shifts substantially over the span of a few versions. For most molecules that shift, the prediction becomes more accurate at later times. This Perspective investigates metrics that can help indicate which molecules will shift substantially in prediction and when the shift will occur.
Published: 2022
Full Text: View/download PDF

2. Prediction Accuracy of Production ADMET Models as a Function of Version: Activity Cliffs Rule.

Author: Sheridan RP, Culberson JC, Joshi E, Tudor M, and Karnachi P
Subjects: Quantitative Structure-Activity Relationship
Abstract: As with many other institutions, our company maintains many quantitative structure-activity relationship (QSAR) models of absorption, distribution, metabolism, excretion, and toxicity (ADMET) end points and updates the models regularly. We recently examined version-to-version predictivity for these models over a period of 10 years. In this approach we monitor the goodness of prediction of new molecules relative to the training set of model version V before they are incorporated in the updated model V+1. Using a cell-based permeability assay (Papp) as an example, we illustrate how the QSAR models made from this data are generally predictive and can be utilized to enrich chemical designs and synthesis. Despite the obvious utility of these models, we turned up unexpected behavior in Papp and other ADMET activities for which the explanation is not obvious. One such behavior is that the apparent predictivity of the models as measured by root-mean-square-error can vary greatly from version to version and is sometimes very poor. One intuitively appealing explanation is that the observed activities of the new molecules fall outside the bulk of activities in the training set. Alternatively, one may think that the new molecules are exploring different regions of chemical space than the training set. However, the real explanation has to do with activity cliffs. If the observed activities of the new molecules are different than expected based on similar molecules in the training set, the predictions will be less accurate. This is true for all our ADMET end points.
Published: 2022
Full Text: View/download PDF

3. Nearest Neighbor Gaussian Process for Quantitative Structure-Activity Relationships.

Author: DiFranzo A, Sheridan RP, Liaw A, and Tudor M
Subjects: Cluster Analysis, Normal Distribution, Drug Discovery, Quantitative Structure-Activity Relationship
Abstract: While Gaussian process models are typically restricted to smaller data sets, we propose a variation which extends its applicability to the larger data sets common in the industrial drug discovery space, making it relatively novel in the quantitative structure-activity relationship (QSAR) field. By incorporating locality-sensitive hashing for fast nearest neighbor searches, the nearest neighbor Gaussian process model makes predictions with time complexity that is sub-linear with the sample size. The model can be efficiently built, permitting rapid updates to prevent degradation as new data is collected. Given its small number of hyperparameters, it is robust against overfitting and generalizes about as well as other common QSAR models. Like the usual Gaussian process model, it natively produces principled and well-calibrated uncertainty estimates on its predictions. We compare this new model with implementations of random forest, light gradient boosting, and k -nearest neighbors to highlight these promising advantages. The code for the nearest neighbor Gaussian process is available at https://github.com/Merck/nngp.
Published: 2020
Full Text: View/download PDF

4. Experimental Error, Kurtosis, Activity Cliffs, and Methodology: What Limits the Predictivity of Quantitative Structure-Activity Relationship Models?

Author: Sheridan RP, Karnachi P, Tudor M, Xu Y, Liaw A, Shah F, Cheng AC, Joshi E, Glick M, and Alvarez J
Subjects: Structure-Activity Relationship, Uncertainty, Quantitative Structure-Activity Relationship, Scientific Experimental Error
Abstract: Given a particular descriptor/method combination, some quantitative structure-activity relationship (QSAR) datasets are very predictive by random-split cross-validation while others are not. Recent literature in modelability suggests that the limiting issue for predictivity is in the data, not the QSAR methodology, and the limits are due to activity cliffs. Here, we investigate, on in-house data, the relative usefulness of experimental error, distribution of the activities, and activity cliff metrics in determining how predictive a dataset is likely to be. We include unmodified in-house datasets, datasets that should be perfectly predictive based only on the chemical structure, datasets where the distribution of activities is manipulated, and datasets that include a known amount of added noise. We find that activity cliff metrics determine predictivity better than the other metrics we investigated, whatever the type of dataset, consistent with the modelability literature. However, such metrics cannot distinguish real activity cliffs due to large uncertainties in the activities. We also show that a number of modern QSAR methods, and some alternative descriptors, are equally bad at predicting the activities of compounds on activity cliffs, consistent with the assumptions behind "modelability." Finally, we relate time-split predictivity with random-split predictivity and show that different coverages of chemical space are at least as important as uncertainty in activity and/or activity cliffs in limiting predictivity.
Published: 2020
Full Text: View/download PDF

5. Building Quantitative Structure-Activity Relationship Models Using Bayesian Additive Regression Trees.

Author: Feng D, Svetnik V, Liaw A, Pratola M, and Sheridan RP
Subjects: Algorithms, Bayes Theorem, Machine Learning, Models, Chemical, Pharmaceutical Preparations chemistry, Regression Analysis, Small Molecule Libraries chemistry, Drug Discovery methods, Quantitative Structure-Activity Relationship
Abstract: Quantitative structure-activity relationship (QSAR) is a very commonly used technique for predicting the biological activity of a molecule using information contained in the molecular descriptors. The large number of compounds and descriptors and the sparseness of descriptors pose important challenges to traditional statistical methods and machine learning (ML) algorithms (such as random forest (RF)) used in this field. Recently, Bayesian Additive Regression Trees (BART), a flexible Bayesian nonparametric regression approach, has been demonstrated to be competitive with widely used ML approaches. Instead of only focusing on accurate point estimation, BART is formulated entirely in a hierarchical Bayesian modeling framework, allowing one to also quantify uncertainties and hence to provide both point and interval estimation for a variety of quantities of interest. We studied BART as a model builder for QSAR and demonstrated that the approach tends to have predictive performance comparable to RF. More importantly, we investigated BART's natural capability to analyze truncated (or qualified) data, generate interval estimates for molecular activities as well as descriptor importance, and conduct model diagnosis, which could not be easily handled through other approaches.
Published: 2019
Full Text: View/download PDF

6. Interpretation of QSAR Models by Coloring Atoms According to Changes in Predicted Activity: How Robust Is It?

Author: Sheridan RP
Subjects: Humans, Machine Learning, Workflow, Computer Simulation, Quantitative Structure-Activity Relationship
Abstract: Most chemists would agree that the ability to interpret a quantitative structure-activity relationship (QSAR) model is as important as the ability of the model to make accurate predictions. One type of interpretation is coloration of atoms in molecules according to the contribution of each atom to the predicted activity, as in "heat maps". The ability to determine which parts of a molecule increase the activity in question and which decrease it should be useful to chemists who want to modify the molecule. For that type of application, we would hope the coloration to not be particularly sensitive to the details of model building. In this Article, we examine a number of aspects of coloration against 20 combinations of descriptors and QSAR methods. We demonstrate that atom-level coloration is much less robust to descriptor/method combinations than cross-validated predictions. Even in ideal cases where the contribution of individual atoms is known, we cannot always recover the important atoms for some descriptor/method combinations. Thus, model interpretation by atom coloration may not be as simple as it first appeared.
Published: 2019
Full Text: View/download PDF

7. Demystifying Multitask Deep Neural Networks for Quantitative Structure-Activity Relationships.

Author: Xu Y, Ma J, Liaw A, Sheridan RP, and Svetnik V
Subjects: Artificial Intelligence, Computer Simulation, Drug Delivery Systems, Models, Chemical, Neural Networks, Computer, Proteins chemistry, Quantitative Structure-Activity Relationship
Abstract: Deep neural networks (DNNs) are complex computational models that have found great success in many artificial intelligence applications, such as computer vision1,2 and natural language processing.3,4 In the past four years, DNNs have also generated promising results for quantitative structure-activity relationship (QSAR) tasks.5,6 Previous work showed that DNNs can routinely make better predictions than traditional methods, such as random forests, on a diverse collection of QSAR data sets. It was also found that multitask DNN models-those trained on and predicting multiple QSAR properties simultaneously-outperform DNNs trained separately on the individual data sets in many, but not all, tasks. To date there has been no satisfactory explanation of why the QSAR of one task embedded in a multitask DNN can borrow information from other unrelated QSAR tasks. Thus, using multitask DNNs in a way that consistently provides a predictive advantage becomes a challenge. In this work, we explored why multitask DNNs make a difference in predictive performance. Our results show that during prediction a multitask DNN does borrow "signal" from molecules with similar structures in the training sets of the other tasks. However, whether this borrowing leads to better or worse predictive performance depends on whether the activities are correlated. On the basis of this, we have developed a strategy to use multitask DNNs that incorporate prior domain knowledge to select training sets with correlated activities, and we demonstrate its effectiveness on several examples.
Published: 2017
Full Text: View/download PDF

8. Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships.

Author: Sheridan RP, Wang WM, Liaw A, Ma J, and Gifford EM
Subjects: Algorithms, Databases, Pharmaceutical, Drug Discovery, Humans, Models, Biological, Software, Quantitative Structure-Activity Relationship
Abstract: In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
Published: 2016
Full Text: View/download PDF

9. Debunking the Idea that Ligand Efficiency Indices Are Superior to pIC50 as QSAR Activities.

Author: Sheridan RP
Subjects: Humans, Inhibitory Concentration 50, Ligands, Quantitative Structure-Activity Relationship
Abstract: Several papers have appeared in which a ligand efficiency index instead of pIC50 is used as the activity in QSAR. The claim is that better fits and predictions are obtained with ligand efficiency. We show on both public-domain and in-house data sets that the apparent superiority is a statistical artifact that occurs when ligand efficiency indices are correlated with the physical property included in their definition (number of non-hydrogens, ALOGP, TPSA, etc.) and when the property is easier to predict than the original pIC50.
Published: 2016
Full Text: View/download PDF

10. The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity.

Author: Sheridan RP
Subjects: Databases, Pharmaceutical, Models, Statistical, Uncertainty, Informatics methods, Quantitative Structure-Activity Relationship
Abstract: Unlabelled: In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities (an "activity model"). The aim of the field of domain applicability (DA) is to estimate the uncertainty of prediction of a specific molecule on a specific activity model. A number of DA metrics have been proposed in the literature for this purpose. A quantitative model of the prediction uncertainty (an "error model") can be built using one or more of these metrics. A previous publication from our laboratory ( Sheridan , R. P. J. Chem. Inf., Model: 2013 , 53 , 2837 - 2850 ) suggested that QSAR methods such as random forest could be used to build error models by fitting unsigned prediction errors against DA metrics. The QSAR paradigm contains two useful techniques: descriptor importance can determine which DA metrics are most useful, and cross-validation can be used to tell which subset of DA metrics is sufficient to estimate the unsigned errors. Previously we studied 10 large, diverse data sets and seven DA metrics. For those data sets for which it is possible to build a significant error model from those seven metrics, only two metrics were sufficient to account for almost all of the information in the error model. These were TREE_SD (the variation of prediction among random forest trees) and PREDICTED (the predicted activity itself). In this paper we show that when data sets are less diverse, as for example in QSAR models of molecules in a single chemical series, these two DA metrics become less important in explaining prediction error, and the DA metric SIMILARITYNEAREST1 (the similarity of the molecule being predicted to the closest training set compound) becomes more important. Our recommendation is that when the mean pairwise similarity (measured with the Carhart AP descriptor and the Dice similarity index) within a QSAR training set is less than 0.5, one can use only TREE_SD, PREDICTED to form the error model, but otherwise one should use TREE_SD, PREDICTED, SIMILARITYNEAREST1.
Published: 2015
Full Text: View/download PDF

11. eCounterscreening: using QSAR predictions to prioritize testing for off-target activities and setting the balance between benefit and risk.

Author: Sheridan RP, McMasters DR, Voigt JH, and Wildey MJ
Subjects: Algorithms, Data Mining, Databases, Factual, Drug Discovery, Drug-Related Side Effects and Adverse Reactions, Models, Chemical, Predictive Value of Tests, Risk Assessment, High-Throughput Screening Assays methods, Quantitative Structure-Activity Relationship
Abstract: During drug development, compounds are tested against counterscreens, a panel of off-target activities that would be undesirable for a drug to have. Testing every compound against every counterscreen is generally too costly in terms of time and money, and we need to find a rational way of prioritizing counterscreen testing. Here we present the eCounterscreening paradigm, wherein predictions from QSAR models for counterscreen activity are used to generate a recommendation as to whether a specific compound in a specific project should be tested against a specific counterscreen. The rules behind the recommendations, which can be summarized in a risk-benefit plot specific for a counterscreen/project combination, are based on a previously assembled database of prospective QSAR predictions. The recommendations require two user-defined cutoffs: the level of activity in a specific counterscreen that is considered undesirable and the level of risk the chemist is willing to accept that an undesired counterscreen activity will go undetected. We demonstrate in a simulated prospective experiment that eCounterscreening can be used to postpone a large fraction of counterscreen testing and still have an acceptably low risk of undetected counterscreen activity.
Published: 2015
Full Text: View/download PDF

12. Deep neural nets as a method for quantitative structure-activity relationships.

Author: Ma J, Sheridan RP, Liaw A, Dahl GE, and Svetnik V
Subjects: Algorithms, Drug Discovery, Machine Learning, Prospective Studies, Support Vector Machine, Workflow, Neural Networks, Computer, Quantitative Structure-Activity Relationship
Abstract: Neural networks were widely used for quantitative structure-activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machine (SVM) and random forest (RF), which arose in the early 2000s. The last 10 years has witnessed a revival of neural networks in the machine learning community thanks to new methods for preventing overfitting, more efficient training algorithms, and advancements in computer hardware. In particular, deep neural nets (DNNs), i.e. neural nets with more than one hidden layer, have found great successes in many applications, such as computer vision and natural language processing. Here we show that DNNs can routinely make better prospective predictions than RF on a set of large diverse QSAR data sets that are taken from Merck's drug discovery effort. The number of adjustable parameters needed for DNNs is fairly large, but our results show that it is not necessary to optimize them for individual data sets, and a single set of recommended parameters can achieve better performance than RF for most of the data sets we studied. The usefulness of the parameters is demonstrated on additional data sets not used in the calibration. Although training DNNs is still computationally intensive, using graphical processing units (GPUs) can make this issue manageable.
Published: 2015
Full Text: View/download PDF

13. Global quantitative structure-activity relationship models vs selected local models as predictors of off-target activities for project compounds.

Author: Sheridan RP
Subjects: Calibration, Models, Chemical, Quantitative Structure-Activity Relationship
Abstract: In the pharmaceutical industry, it is common for large numbers of compounds to be tested for off-target activities. Given a compound synthesized for an on-target project P, what is the best way to predict its off-target activity X? Is it better to use a global quantitative structure-activity relationship (QSAR) model calibrated against all compounds tested for X, or is it better to use a local model for X calibrated against only the set of compounds in project P? The literature is not consistent on this topic, and strong claims have been made for either. One particular idea is that local models will be superior to global models in prospective prediction if one generates many local models and chooses the type of local model that best predicts recent data. We tested this idea via simulated prospective prediction using in-house data involving compounds in 11 projects tested for 9 off-target activities. In our hands, the local model that best predicts the recent past is seldom the local model that is best at predicting the immediate future. Also, the local model that best predicts the recent past is not systematically better than the global model. This means the complexity of having project- or series-specific models for X can be avoided; a single global model for X is sufficient. We suggest that the relative predictivity of global vs local models may depend on the type of chemical descriptor used. Finally, we speculate why, contrary to observation, intuition suggests local models should be superior to global models.
Published: 2014
Full Text: View/download PDF

14. Three useful dimensions for domain applicability in QSAR models using random forest.

Author: Sheridan RP
Subjects: Discriminant Analysis, Reproducibility of Results, Decision Trees, Quantitative Structure-Activity Relationship
Abstract: One popular metric for estimating the accuracy of prospective quantitative structure-activity relationship (QSAR) predictions is based on the similarity of the compound being predicted to compounds in the training set from which the QSAR model was built. More recent work in the field has indicated that other parameters might be equally or more important than similarity. Here we make use of two additional parameters: the variation of prediction among random forest trees (less variation among trees indicates more accurate prediction) and the prediction itself (certain ranges of activity are intrinsically easier to predict than others). The accuracy of prediction for a QSAR model, as measured by the root-mean-square error, can be estimated by cross-validation on the training set at the time of model-building and stored as a three-dimensional array of bins. This is an obvious extension of the one-dimensional array of bins we previously proposed for similarity to the training set [Sheridan et al. J. Chem. Inf. Comput. Sci.2004, 44, 1912-1928]. We show that using these three parameters simultaneously adds much more discrimination in prediction accuracy than any single parameter. This approach can be applied to any QSAR method that produces an ensemble of models. We also show that the root-mean-square errors produced by cross-validation are predictive of root-mean-square errors of compounds tested after the model was built., (© 2012 American Chemical Society)
Published: 2012
Full Text: View/download PDF

15. Comparison of random forest and Pipeline Pilot Naïve Bayes in prospective QSAR predictions.

Author: Chen B, Sheridan RP, Hornak V, and Voigt JH
Subjects: Bayes Theorem, Enzyme Inhibitors chemistry, Enzyme Inhibitors pharmacology, Humans, Time Factors, Decision Trees, Quantitative Structure-Activity Relationship
Abstract: Random forest is currently considered one of the best QSAR methods available in terms of accuracy of prediction. However, it is computationally intensive. Naïve Bayes is a simple, robust classification method. The Laplacian-modified Naïve Bayes implementation is the preferred QSAR method in the widely used commercial chemoinformatics platform Pipeline Pilot. We made a comparison of the ability of Pipeline Pilot Naïve Bayes (PLPNB) and random forest to make accurate predictions on 18 large, diverse in-house QSAR data sets. These include on-target and ADME-related activities. These data sets were set up as classification problems with either binary or multicategory activities. We used a time-split method of dividing training and test sets, as we feel this is a realistic way of simulating prospective prediction. PLPNB is computationally efficient. However, random forest predictions are at least as good and in many cases significantly better than those of PLPNB on our data sets. PLPNB performs better with ECFP4 and ECFP6 descriptors, which are native to Pipeline Pilot, and more poorly with other descriptors we tried., (© 2012 American Chemical Society)
Published: 2012
Full Text: View/download PDF

16. QSAR models for predicting the similarity in binding profiles for pairs of protein kinases and the variation of models between experimental data sets.

Author: Sheridan RP, Nam K, Maiorov VN, McMasters DR, and Cornell WD
Subjects: Binding Sites, Humans, Models, Molecular, Protein Binding, Protein Kinase Inhibitors chemistry, Protein Kinases chemistry, Protein Kinase Inhibitors metabolism, Protein Kinases metabolism, Quantitative Structure-Activity Relationship
Abstract: We propose a direct QSAR methodology to predict how similar the inhibitor-binding profiles of two protein kinases are likely to be, based on the properties of the residues surrounding the ATP-binding site. We produce a random forest model for each of five data sets (one in-house, four from the literature) where multiple compounds are tested on many kinases. Each model is self-consistent by cross-validation, and all models point to only a few residues in the active site controlling the binding profiles. While all models include the "gatekeeper" as one of the important residues, consistent with previous literature, some models suggest other residues as being more important. We apply each model to predict the similarity in binding profile to all pairs in a set of 411 kinases from the human genome and get very different predictions from each model. This turns out not to be an issue with model-building but with the fact that the experimental data sets disagree about which kinases are similar to which others. It is possible to build a model combining all the data from the five data sets that is reasonably self-consistent but not surprisingly, given the disagreement between data sets, less self-consistent than the individual models.
Published: 2009
Full Text: View/download PDF

17. Why do we need so many chemical similarity search methods?

Author: Sheridan RP and Kearsley SK
Subjects: Decision Support Techniques, Drug Design, Humans, Molecular Conformation, Molecular Structure, Drug Evaluation, Preclinical methods, Quantitative Structure-Activity Relationship
Abstract: Computational tools to search chemical structure databases are essential to finding leads early in a drug discovery project. Similarity methods are among the most diverse and most useful. We will present some lessons we have gathered over many years experience with in-house methods on several therapeutic problems. The effectiveness of any similarity method can vary greatly from one biological activity to another in a way that is difficult to predict. Also, any two methods tend to select different subsets of actives from a database, so it is advisable to use several search methods where possible.
Published: 2002
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

17 results on '"Sheridan, Robert P."'

1. Stability of Prediction in Production ADMET Models as a Function of Version: Why and When Predictions Change.

2. Prediction Accuracy of Production ADMET Models as a Function of Version: Activity Cliffs Rule.

3. Nearest Neighbor Gaussian Process for Quantitative Structure-Activity Relationships.

4. Experimental Error, Kurtosis, Activity Cliffs, and Methodology: What Limits the Predictivity of Quantitative Structure-Activity Relationship Models?

5. Building Quantitative Structure-Activity Relationship Models Using Bayesian Additive Regression Trees.

6. Interpretation of QSAR Models by Coloring Atoms According to Changes in Predicted Activity: How Robust Is It?

7. Demystifying Multitask Deep Neural Networks for Quantitative Structure-Activity Relationships.

8. Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships.

9. Debunking the Idea that Ligand Efficiency Indices Are Superior to pIC50 as QSAR Activities.

10. The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity.

11. eCounterscreening: using QSAR predictions to prioritize testing for off-target activities and setting the balance between benefit and risk.

12. Deep neural nets as a method for quantitative structure-activity relationships.

13. Global quantitative structure-activity relationship models vs selected local models as predictors of off-target activities for project compounds.

14. Three useful dimensions for domain applicability in QSAR models using random forest.

15. Comparison of random forest and Pipeline Pilot Naïve Bayes in prospective QSAR predictions.

16. QSAR models for predicting the similarity in binding profiles for pairs of protein kinases and the variation of models between experimental data sets.

17. Why do we need so many chemical similarity search methods?

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

17 results on '"Sheridan, Robert P."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources