132 results on '"Varnek, A."'
Search Results
2. Meta-GTM: Visualization and Analysis of the Chemical Library Space
- Author
-
Pikalyova, Regina, primary, Zabolotna, Yuliana, additional, Horvath, Dragos, additional, Marcou, Gilles, additional, and Varnek, Alexandre, additional
- Published
- 2023
- Full Text
- View/download PDF
3. GENERA: A Combined Genetic/Deep-Learning Algorithm for Multiobjective Target-Oriented De Novo Design
- Author
-
Lamanna, Giuseppe, primary, Delre, Pietro, additional, Marcou, Gilles, additional, Saviano, Michele, additional, Varnek, Alexandre, additional, Horvath, Dragos, additional, and Mangiatordi, Giuseppe Felice, additional
- Published
- 2023
- Full Text
- View/download PDF
4. Inverse QSAR: Reversing Descriptor-Driven Prediction Pipeline Using Attention-Based Conditional Variational Autoencoder
- Author
-
William Bort, Daniyar Mazitov, Dragos Horvath, Fanny Bonachera, Arkadii Lin, Gilles Marcou, Igor Baskin, Timur Madzhidov, and Alexandre Varnek
- Subjects
Molecular Docking Simulation ,General Chemical Engineering ,Quantitative Structure-Activity Relationship ,General Chemistry ,Library and Information Sciences ,Computer Science Applications - Abstract
In order to better foramize it, the notorious inverse-QSAR problem (finding structures of given QSAR-predicted properties) is considered in this paper as a two-step process including (i) finding "seed" descriptor vectors corresponding to user-constrained QSAR model output values and (ii) identifying the chemical structures best matching the "seed" vectors. The main development effort here was focused on the latter stage, proposing a new attention-based conditional variational autoencoder neural-network architecture based on recent developments in attention-based methods. The obtained results show that this workflow was capable of generating compounds predicted to display desired activity while being completely novel compared to the training database (ChEMBL). Moreover, the generated compounds show acceptable druglikeness and synthetic accessibility. Both pharmacophore and docking studies were carried out as "orthogonal"
- Published
- 2022
- Full Text
- View/download PDF
5. Chemspace Atlas: Multiscale Chemography of Ultralarge Libraries for Drug Discovery
- Author
-
Yuliana Zabolotna, Fanny Bonachera, Dragos Horvath, Arkadii Lin, Gilles Marcou, Olga Klimchuk, and Alexandre Varnek
- Subjects
Small Molecule Libraries ,Zinc ,General Chemical Engineering ,Drug Discovery ,DNA ,General Chemistry ,Library and Information Sciences ,Gene Library ,Computer Science Applications - Abstract
Nowadays, drug discovery is inevitably intertwined with the usage of large compound collections. Understanding of their chemotype composition and physicochemical property profiles is of the highest importance for successful hit identification. Efficient polyfunctional tools allowing multifaceted analysis of constantly growing chemical libraries must be Big Data-compatible. Here, we present the freely accessible ChemSpace Atlas (https://chematlas.chimie.unistra.fr), which includes almost 40K hierarchically organized Generative Topographic Maps (GTM) accommodating up to 500 M compounds covering fragment-like, lead-like, drug-like, PPI-like, and NP-like chemical subspaces. They allow users to navigate and analyze ZINC, ChEMBL, and COCONUT from multiple perspectives on different scales: from a bird's eye view of the entire library to structural pattern detection in small clusters. Around 20 physicochemical properties and almost 750 biological activities can be visualized (associated with map zones), supporting activity profiling and analogue search. Moreover, ChemScape Atlas will be extended toward new chemical subspaces (e.g., DNA-encoded libraries and synthons) and functionalities (ADMETox profiling and property-guided de novo compound generation).
- Published
- 2022
- Full Text
- View/download PDF
6. CGRdb2.0: A Python Database Management System for Molecules, Reactions, and Chemical Data
- Author
-
Timur R. Gimadiev, Pavel Sidorov, Aigul Khakimova, R. I. Nugmanov, Timur I. Madzhidov, Alexandre Varnek, and Adeliya Fatykhova
- Subjects
SQL ,Similarity (geometry) ,Databases, Factual ,Database ,Syntax (programming languages) ,Computer science ,General Chemical Engineering ,Chemical data ,General Chemistry ,Library and Information Sciences ,Python (programming language) ,computer.software_genre ,Computer Science Applications ,Benchmarking ,Database Management Systems ,Graph (abstract data type) ,Molecule ,computer ,computer.programming_language - Abstract
This work introduces CGRdb2.0─an open-source database management system for molecules, reactions, and chemical data. CGRdb2.0 is a Python package connecting to a PostgreSQL database that enables native searches for molecules and reactions without complicated SQL syntax. The library provides out-of-the-box implementations for similarity and substructure searches for molecules, as well as similarity and substructure searches for reactions in two ways─based on reaction components and based on the Condensed Graph of Reaction approach, the latter significantly accelerating the performance. In benchmarking studies with the RDKit database cartridge, we demonstrate that CGRdb2.0 performs searches faster for smaller data sets, while allowing for interactive access to the retrieved data.
- Published
- 2021
- Full Text
- View/download PDF
7. Inverse QSAR: Reversing Descriptor-Driven Prediction Pipeline Using Attention-Based Conditional Variational Autoencoder
- Author
-
Bort, William, primary, Mazitov, Daniyar, additional, Horvath, Dragos, additional, Bonachera, Fanny, additional, Lin, Arkadii, additional, Marcou, Gilles, additional, Baskin, Igor, additional, Madzhidov, Timur, additional, and Varnek, Alexandre, additional
- Published
- 2022
- Full Text
- View/download PDF
8. HyFactor: A Novel Open-Source, Graph-Based Architecture for Chemical Structure Generation
- Author
-
Tagir Akhmetshin, Arkadii Lin, Daniyar Mazitov, Yuliana Zabolotna, Evgenii Ziaikin, Timur Madzhidov, and Alexandre Varnek
- Subjects
General Chemical Engineering ,General Chemistry ,Library and Information Sciences ,Software ,Computer Science Applications - Abstract
Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce novel open-source architecture HyFactor in which, similar to the InChI linear notation, the number of hydrogens attached to the heavy atoms was considered instead of the bond types. HyFactor was benchmarked on the ZINC 250K, MOSES, and ChEMBL data sets against conventional graph-based architecture ReFactor, representing our implementation of the reported DEFactor architecture in the literature. On average, HyFactor models contain some 20% less fitting parameters than those of ReFactor. The two architectures display similar validity, uniqueness, and reconstruction rates. Compared to the training set compounds, HyFactor generates more similar structures than ReFactor. This could be explained by the fact that the latter generates many open-chain analogues of cyclic structures in the training set. It has been demonstrated that the reconstruction error of heavy molecules can be significantly reduced using the data augmentation technique. The codes of HyFactor and ReFactor as well as all models obtained in this study are publicly available from our GitHub repository: https://github.com/Laboratoire-de-Chemoinformatique/HyFactor.
- Published
- 2022
9. Chemspace Atlas: Multiscale Chemography of Ultralarge Libraries for Drug Discovery
- Author
-
Zabolotna, Yuliana, primary, Bonachera, Fanny, additional, Horvath, Dragos, additional, Lin, Arkadii, additional, Marcou, Gilles, additional, Klimchuk, Olga, additional, and Varnek, Alexandre, additional
- Published
- 2022
- Full Text
- View/download PDF
10. HyFactor: A Novel Open-Source, Graph-Based Architecture for Chemical Structure Generation
- Author
-
Akhmetshin, Tagir, primary, Lin, Arkadii, additional, Mazitov, Daniyar, additional, Zabolotna, Yuliana, additional, Ziaikin, Evgenii, additional, Madzhidov, Timur, additional, and Varnek, Alexandre, additional
- Published
- 2022
- Full Text
- View/download PDF
11. Chemography: Searching for Hidden Treasures
- Author
-
Alexandre Varnek, Dmitriy M. Volochnyuk, Dragos Horvath, Gilles Marcou, Arkadii Lin, Yuliana Zabolotna, Chimie de la matière complexe (CMC), Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Chémoinformatique, Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Institute of Organic Chemistry of NASU [Kyiv], and National Academy of Sciences of Ukraine (NASU)
- Subjects
Engineering ,010304 chemical physics ,business.industry ,Chemistry, Pharmaceutical ,General Chemical Engineering ,General Chemistry ,Library and Information Sciences ,01 natural sciences ,0104 chemical sciences ,Computer Science Applications ,World Wide Web ,010404 medicinal & biomolecular chemistry ,0103 physical sciences ,business ,[CHIM.CHEM]Chemical Sciences/Cheminformatics - Abstract
International audience; The days when medicinal chemistry was limited to a few series of compounds of therapeutic interest are long gone. Nowadays, no human may succeed to acquire a complete overview of more than a billion existing or feasible compounds within which the potential “blockbuster drugs” are well hidden and yet only a few mouse clicks away. To reach these “hidden treasures”, we adapted the generative topographic mapping method to enable efficient navigation through the chemical space, from a global overview to a structural pattern detection, covering, for the first time, the complete ZINC library of purchasable compounds, relative to 1.6 million biologically relevant ChEMBL molecules. About 40 000 hierarchical maps of the chemical space were constructed. Structural motifs inherent to only one library were identified. Roughly 20 000 off-market ChEMBL compound families represent incentives to enrich commercial catalogs. Alternatively, 125 000 ZINC-specific compound classes, absent in structure–activity bases, are novel paths to explore in medicinal chemistry. The complete list of these chemotypes can be downloaded using the link https://forms.gle/B6bUJj82t9EfmttV6.
- Published
- 2020
- Full Text
- View/download PDF
12. 'Big Data' Fast Chemoinformatics Model to Predict Generalized Born Radius and Solvent Accessibility as a Function of Geometry
- Author
-
Gilles Marcou, Alexandre Varnek, Dragos Horvath, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Multilinear map ,010304 chemical physics ,Cheminformatics ,General Chemical Engineering ,Proteins ,Estimator ,Geometry ,General Chemistry ,Function (mathematics) ,Library and Information Sciences ,010402 general chemistry ,01 natural sciences ,Measure (mathematics) ,Linear function ,0104 chemical sciences ,Computer Science Applications ,Set (abstract data type) ,Radius ,0103 physical sciences ,Solvents ,Thermodynamics ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Linear equation ,Mathematics ,Applicability domain - Abstract
International audience; The Generalized Born (GB) solvent model is offering the best accuracy/computing effort ratio yet requires drastic simplifications to estimate of the Effective Born Radii (EBR) in bypassing a too expensive volume integration step. EBRs are a measure of the degree of burial of an atom and not very sensitive to small changes of geometry: in molecular dynamics, the costly EBR update procedure is not mandatory at every step. This work however aims at implementing a GB model into the Sampler for Multiple Protein−Ligand Entities (S4MPLE) evolutionary algorithm with mandatory EBR updates at each step triggering arbitrarily large geometric changes. Therefore, a quantitative structure−property relationship has been developed in order to express the EBRs as a linear function of both the topological neighborhood and geometric occupancy of the space around atoms. A training set of 810 molecular systems, starting from fragment-like to drug-like compounds, proteins, host−guest systems, and ligand−protein complexes, has been compiled. For each species, S4MPLE generated several hundreds of random conformers. For each atom in each geometry of each species, its "standard" EBR was calculated by numeric integration and associated to topological and geometric descriptors of the atom neighborhood. This training set (EBR, atom descriptors) involving >5 M entries was subjected to a boot-strapping multilinear regression process with descriptor selection. In parallel, the strategy was repurposed to also learn atomic solvent-accessible areas (SA) based on the same descriptors. Resulting linear equations were challenged to predict EBR and SA values for a similarly compiled external set of >2000 new molecular systems. Solvation energies calculated with estimated EBR and SA match "standard" energies within the typical error of a force-field-based approach (a few kilocalories per mole). Given the extreme diversity of molecular systems covered by the model, this simple EBR/SA estimator covers a vast applicability domain.
- Published
- 2020
- Full Text
- View/download PDF
13. A Close-up Look at the Chemical Space of Commercially Available Building Blocks for Medicinal Chemistry
- Author
-
Alexandre Varnek, Gilles Marcou, Olexandr Oksiuta, Kostiantyn Gavrylenko, Dragos Horvath, Yuliana Zabolotna, Dmitriy M. Volochnyuk, Sergey V. Ryabukhin, and Yurii S. Moroz
- Subjects
Anions ,Library design ,Drug discovery ,Computer science ,General Chemical Engineering ,Chemistry, Pharmaceutical ,Organic synthesis ,Medicinal chemistry ,General Chemistry ,Library and Information Sciences ,chEMBL ,Chemical space ,Computer Science Applications ,Market fragmentation ,Set (abstract data type) ,Reagents,Chemical reactions ,Generative topographic map ,Drug Discovery ,Molecule ,Indicators and Reagents - Abstract
The ability to efficiently synthesize desired compounds can be a limiting factor for chemical space exploration in drug discovery. This ability is conditioned not only by the existence of well-studied synthetic protocols but also by the availability of corresponding reagents, so-called building blocks (BB). In this work, we present a detailed analysis of the chemical space of 400K purchasable BB. The chemical space was defined by corresponding synthons – fragments contributed to the final molecules upon reaction. They allow an analysis of BB physicochemical properties and diversity, unbiased by the leaving and protective groups in actual reagents. The main classes of BB were analyzed in terms of their availability, rule-of-two-defined quality, and diversity. Available BBs were eventually compared to a reference set of biologically relevant synthons derived from ChEMBL fragmentation, in order to illustrate how well they cover the actual medicinal chemistry needs. This was performed on a newly constructed universal generative topographic map of synthon chemical space, allowing to visualize both libraries and analyze their overlapping and library-specific regions.
- Published
- 2021
14. QSAR Modeling Based on Conformation Ensembles Using a Multi-Instance Learning Approach
- Author
-
Aleksandra Nikonenko, Dmitry V. Zankov, R. I. Nugmanov, Alexandre Varnek, Pavel G. Polishchuk, Igor I. Baskin, Mariia Matveieva, and Timur I. Madzhidov
- Subjects
Quantitative structure–activity relationship ,Databases, Factual ,Computer science ,business.industry ,General Chemical Engineering ,Bioactive molecules ,Deep learning ,Molecular Conformation ,Quantitative Structure-Activity Relationship ,Pattern recognition ,General Chemistry ,Library and Information Sciences ,3d descriptors ,Computer Science Applications ,chemistry.chemical_compound ,chemistry ,Drug Discovery ,Molecular graph ,Artificial intelligence ,business ,Algorithms - Abstract
Modern QSAR approaches have wide practical applications in drug discovery for designing potentially bioactive molecules. If such models are based on the use of 2D descriptors, important information contained in the spatial structures of molecules is lost. The major problem in constructing models using 3D descriptors is the choice of a putative bioactive conformation, which affects the predictive performance. The multi-instance (MI) learning approach considering multiple conformations in model training could be a reasonable solution to the above problem. In this study, we implemented several multi-instance algorithms, both conventional and based on deep learning, and investigated their performance. We compared the performance of MI-QSAR models with those based on the classical single-instance QSAR (SI-QSAR) approach in which each molecule is encoded by either 2D descriptors computed for the corresponding molecular graph or 3D descriptors issued for a single lowest energy conformation. The calculations were carried out on 175 data sets extracted from the ChEMBL23 database. It is demonstrated that (i) MI-QSAR outperforms SI-QSAR in numerous cases and (ii) MI algorithms can automatically identify plausible bioactive conformations.
- Published
- 2021
15. A Close-up Look at the Chemical Space of Commercially Available Building Blocks for Medicinal Chemistry
- Author
-
Zabolotna, Yuliana, primary, Volochnyuk, Dmitriy M., additional, Ryabukhin, Sergey V., additional, Horvath, Dragos, additional, Gavrilenko, Konstantin S., additional, Marcou, Gilles, additional, Moroz, Yurii S., additional, Oksiuta, Oleksandr, additional, and Varnek, Alexandre, additional
- Published
- 2021
- Full Text
- View/download PDF
16. CGRdb2.0: A Python Database Management System for Molecules, Reactions, and Chemical Data
- Author
-
Gimadiev, Timur, primary, Nugmanov, Ramil, additional, Khakimova, Aigul, additional, Fatykhova, Adeliya, additional, Madzhidov, Timur, additional, Sidorov, Pavel, additional, and Varnek, Alexandre, additional
- Published
- 2021
- Full Text
- View/download PDF
17. SynthI: A New Open-Source Tool for Synthon-Based Library Design
- Author
-
Zabolotna, Yuliana, primary, Volochnyuk, Dmitriy M., additional, Ryabukhin, Sergey V., additional, Gavrylenko, Kostiantyn, additional, Horvath, Dragos, additional, Klimchuk, Olga, additional, Oksiuta, Oleksandr, additional, Marcou, Gilles, additional, and Varnek, Alexandre, additional
- Published
- 2021
- Full Text
- View/download PDF
18. CGRtools: Python Library for Molecule, Reaction, and Condensed Graph of Reaction Processing
- Author
-
Alexandre Varnek, R. I. Nugmanov, Timur R. Gimadiev, Timur I. Madzhidov, Tagir Akhmetshin, Valentina A. Afonina, Ravil Mukhametgaleev, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Chemical Phenomena ,010304 chemical physics ,Programming language ,business.industry ,Computer science ,Cheminformatics ,General Chemical Engineering ,General Chemistry ,Library and Information Sciences ,Python (programming language) ,computer.software_genre ,01 natural sciences ,0104 chemical sciences ,Computer Science Applications ,Small Molecule Libraries ,010404 medicinal & biomolecular chemistry ,Software ,Models, Chemical ,0103 physical sciences ,business ,computer ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,computer.programming_language - Abstract
CGRtools is an open-source Python library aimed to handle molecular and reaction information. It is the sole library developed so far which can process condensed graph of reaction (CGR) handling. CGR provides the possibility for advanced operations with reaction information and could be used for reaction descriptor calculation, structure-reactivity modeling, atom-to-atom mapping comparison and correction, reaction center extraction, reaction balancing, and some other related tasks. Unlike other popular libraries, CGRtools is fully written in Python with minor dependencies on other libraries and cross-platform. Reaction, molecule, and CGR objects in CGRtools support native Python methods and are comparable with the help of operations "equal to", "less than", and "bigger than". CGRtools supports common structural formats. CGRtools is distributed via an L-GPL license and available on https://github.com/cimm-kzn/CGRtools .
- Published
- 2019
- Full Text
- View/download PDF
19. QSAR Modeling Based on Conformation Ensembles Using a Multi-Instance Learning Approach
- Author
-
Zankov, Dmitry V., primary, Matveieva, Mariia, additional, Nikonenko, Aleksandra V., additional, Nugmanov, Ramil I., additional, Baskin, Igor I., additional, Varnek, Alexandre, additional, Polishchuk, Pavel, additional, and Madzhidov, Timur I., additional
- Published
- 2021
- Full Text
- View/download PDF
20. Trustworthiness, the Key to Grid-Based Map-Driven Predictive Model Enhancement and Applicability Domain Control
- Author
-
Dragos Horvath, Alexandre Varnek, Gilles Marcou, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Quantitative structure–activity relationship ,010405 organic chemistry ,Computer science ,General Chemical Engineering ,General Chemistry ,Library and Information Sciences ,Grid ,computer.software_genre ,01 natural sciences ,Regression ,0104 chemical sciences ,Computer Science Applications ,010404 medicinal & biomolecular chemistry ,Molecular descriptor ,Coherence (signal processing) ,Covariant transformation ,Data mining ,computer ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Algorithms ,Parametric statistics ,Applicability domain - Abstract
In chemography, grid-based maps sample molecular descriptor space by injecting a set of nodes, and then linking them to some regular 2D grid representing the map. They include self-organizing maps (SOMs) and generative topographic maps (GTMs). Grid-based maps are predictive because any compound thereupon projected can "inherit" the properties of its residence node(s)-node properties themselves "inherited" from node-neighboring training set compounds. This Article proposes a formalism to define the trustworthiness of these nodes as "providers" of structure-activity information captured from training compounds. An empirical four-parameter node trustworthiness (NT) function of density (sparsely populated nodes are less trustworthy) and coherence (nodes with training set residents of divergent properties are less trustworthy) is proposed. Based upon it, a trustworthiness score T is used to delimit the applicability domain (AD) by means of a trustworthiness threshold TT. For each parameter setup, success of ensuing inside-AD predictions is monitored. It is seen that setup-specific success levels (averaged over large pools of prediction challenges) are highly covariant, irrespectively of the targets of prediction challenges, of the (classification or regression) type of problems, of the specific parametrization, and even of the nature (GTM or SOM) of underlying maps. Thus, success levels determined on the basis of regression problems (445 target-specific affinity QSAR sets) on GTMs and levels returned by completely unrelated classification problems (319 target-specific active-/inactive-labeled sets) on SOMs were seen to correlate to a degree of 70%. Therefore, a common, general-purpose setup of the herein proposed parametric AD definition was shown to generally apply to grid-based map-driven property prediction problems.
- Published
- 2020
- Full Text
- View/download PDF
21. Combined Graph/Relational Database Management System for Calculated Chemical Reaction Pathway Data
- Author
-
Gimadiev, Timur, primary, Nugmanov, Ramil, additional, Batyrshin, Dinar, additional, Madzhidov, Timur, additional, Maeda, Satoshi, additional, Sidorov, Pavel, additional, and Varnek, Alexandre, additional
- Published
- 2021
- Full Text
- View/download PDF
22. Chemography: Searching for Hidden Treasures
- Author
-
Zabolotna, Yuliana, primary, Lin, Arkadii, additional, Horvath, Dragos, additional, Marcou, Gilles, additional, Volochnyuk, Dmitriy M., additional, and Varnek, Alexandre, additional
- Published
- 2020
- Full Text
- View/download PDF
23. Trustworthiness, the Key to Grid-Based Map-Driven Predictive Model Enhancement and Applicability Domain Control
- Author
-
Horvath, Dragos, primary, Marcou, Gilles, additional, and Varnek, Alexandre, additional
- Published
- 2020
- Full Text
- View/download PDF
24. “Big Data” Fast Chemoinformatics Model to Predict Generalized Born Radius and Solvent Accessibility as a Function of Geometry
- Author
-
Horvath, Dragos, primary, Marcou, Gilles, additional, and Varnek, Alexandre, additional
- Published
- 2020
- Full Text
- View/download PDF
25. Conjugated Quantitative Structure-Property Relationship Models: Application to Simultaneous Prediction of Tautomeric Equilibrium Constants and Acidity of Molecules
- Author
-
Assima Rakhimbekova, R. I. Nugmanov, Igor I. Baskin, Dmitry V. Zankov, Alexandre Varnek, Marina A. Kazymova, Timur I. Madzhidov, Timur R. Gimadiev, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
General Chemical Engineering ,Thermodynamics ,Quantitative Structure-Activity Relationship ,Library and Information Sciences ,Conjugated system ,01 natural sciences ,Quantitative Structure Property Relationship ,0103 physical sciences ,Drug Discovery ,Molecule ,Organic Chemicals ,Equilibrium constant ,Mathematical relationship ,010304 chemical physics ,Molecular Structure ,Chemistry ,Stereoisomerism ,General Chemistry ,Tautomer ,0104 chemical sciences ,Computer Science Applications ,010404 medicinal & biomolecular chemistry ,Models, Chemical ,Pharmaceutical Preparations ,Solvents ,Neural Networks, Computer ,Acids ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Algorithms - Abstract
Here, we describe a concept of conjugated models for several properties (activities) linked by a strict mathematical relationship. This relationship can be directly integrated analytically into the ridge regression (RR) algorithm or accounted for in a special case of "twin" neural networks (NN). Developed approaches were applied to the modeling of the logarithm of the prototropic tautomeric constant (logK
- Published
- 2019
- Full Text
- View/download PDF
26. SynthI: A New Open-Source Tool for Synthon-Based Library Design
- Author
-
Zabolotna, Yuliana, Volochnyuk, Dmitriy M., Ryabukhin, Sergey V., Gavrylenko, Kostiantyn, Horvath, Dragos, Klimchuk, Olga, Oksiuta, Oleksandr, Marcou, Gilles, and Varnek, Alexandre
- Abstract
Most of the existing computational tools for de novo library design are focused on the generation, rational selection, and combination of promising structural motifs to form members of the new library. However, the absence of a direct link between the chemical space of the retrosynthetically generated fragments and the pool of available reagents makes such approaches appear as rather theoretical and reality-disconnected. In this context, here we present Synthons Interpreter (SynthI), a new open-source toolkit for de novo library design that allows merging those two chemical spaces into a single synthons space. Here synthons are defined as actual fragments with valid valences and special labels, specifying the position and the nature of reactive centers. They can be issued from either the “breakup” of reference compounds according to 38 retrosynthetic rules or real reagents, after leaving group withdrawal or transformation. Such an approach not only enables the design of synthetically accessible libraries and analog generation but also facilitates reagents (building blocks) analysis in the medicinal chemistry context. SynthI code is publicly available at https://github.com/Laboratoire-de-Chemoinformatique/SynthI.
- Published
- 2022
- Full Text
- View/download PDF
27. A Close-up Look at the Chemical Space of Commercially Available Building Blocks for Medicinal Chemistry
- Author
-
Zabolotna, Yuliana, Volochnyuk, Dmitriy M., Ryabukhin, Sergey V., Horvath, Dragos, Gavrilenko, Konstantin S., Marcou, Gilles, Moroz, Yurii S., Oksiuta, Oleksandr, and Varnek, Alexandre
- Abstract
The ability to efficiently synthesize desired compounds can be a limiting factor for chemical space exploration in drug discovery. This ability is conditioned not only by the existence of well-studied synthetic protocols but also by the availability of corresponding reagents, so-called building blocks (BBs). In this work, we present a detailed analysis of the chemical space of 400 000 purchasable BBs. The chemical space was defined by corresponding synthons─fragments contributed to the final molecules upon reaction. They allow an analysis of BB physicochemical properties and diversity, unbiased by the leaving and protective groups in actual reagents. The main classes of BBs were analyzed in terms of their availability, rule-of-two-defined quality, and diversity. Available BBs were eventually compared to a reference set of biologically relevant synthons derived from ChEMBL fragmentation, in order to illustrate how well they cover the actual medicinal chemistry needs. This was performed on a newly constructed universal generative topographic map of synthon chemical space that enables visualization of both libraries and analysis of their overlapped and library-specific regions.
- Published
- 2022
- Full Text
- View/download PDF
28. CGRdb2.0: A Python Database Management System for Molecules, Reactions, and Chemical Data
- Author
-
Gimadiev, Timur, Nugmanov, Ramil, Khakimova, Aigul, Fatykhova, Adeliya, Madzhidov, Timur, Sidorov, Pavel, and Varnek, Alexandre
- Abstract
This work introduces CGRdb2.0─an open-source database management system for molecules, reactions, and chemical data. CGRdb2.0 is a Python package connecting to a PostgreSQL database that enables native searches for molecules and reactions without complicated SQL syntax. The library provides out-of-the-box implementations for similarity and substructure searches for molecules, as well as similarity and substructure searches for reactions in two ways─based on reaction components and based on the Condensed Graph of Reaction approach, the latter significantly accelerating the performance. In benchmarking studies with the RDKit database cartridge, we demonstrate that CGRdb2.0 performs searches faster for smaller data sets, while allowing for interactive access to the retrieved data.
- Published
- 2022
- Full Text
- View/download PDF
29. Conjugated Quantitative Structure–Property Relationship Models: Application to Simultaneous Prediction of Tautomeric Equilibrium Constants and Acidity of Molecules
- Author
-
Zankov, Dmitry V., primary, Madzhidov, Timur I., additional, Rakhimbekova, Assima, additional, Gimadiev, Timur R., additional, Nugmanov, Ramil I., additional, Kazymova, Marina A., additional, Baskin, Igor I., additional, and Varnek, Alexandre, additional
- Published
- 2019
- Full Text
- View/download PDF
30. CGRtools: Python Library for Molecule, Reaction, and Condensed Graph of Reaction Processing
- Author
-
Nugmanov, Ramil I., primary, Mukhametgaleev, Ravil N., additional, Akhmetshin, Tagir, additional, Gimadiev, Timur R., additional, Afonina, Valentina A., additional, Madzhidov, Timur I., additional, and Varnek, Alexandre, additional
- Published
- 2019
- Full Text
- View/download PDF
31. CovaDOTS: In Silico Chemistry-Driven Tool to Design Covalent Inhibitors Using a Linking Strategy
- Author
-
Hoffer, Laurent, primary, Saez-Ayala, Magali, additional, Horvath, Dragos, additional, Varnek, Alexandre, additional, Morelli, Xavier, additional, and Roche, Philippe, additional
- Published
- 2019
- Full Text
- View/download PDF
32. De Novo Molecular Design by Combining Deep Autoencoder Recurrent Neural Networks with Generative Topographic Mapping
- Author
-
Sattarov, Boris, primary, Baskin, Igor I., additional, Horvath, Dragos, additional, Marcou, Gilles, additional, Bjerrum, Esben Jannik, additional, and Varnek, Alexandre, additional
- Published
- 2019
- Full Text
- View/download PDF
33. Kernel Target Alignment Parameter: A New Modelability Measure for Regression Tasks
- Author
-
Dragos Horvath, Gilles Marcou, Alexandre Varnek, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
0301 basic medicine ,Serotonin ,Jaccard index ,Mean squared error ,Computer science ,General Chemical Engineering ,Quantitative Structure-Activity Relationship ,Library and Information Sciences ,computer.software_genre ,01 natural sciences ,Measure (mathematics) ,Set (abstract data type) ,03 medical and health sciences ,Molecular descriptor ,Toxicity Tests ,Series (mathematics) ,Tetrahymena pyriformis ,business.industry ,Pattern recognition ,General Chemistry ,Models, Theoretical ,0104 chemical sciences ,Computer Science Applications ,Data set ,010404 medicinal & biomolecular chemistry ,030104 developmental biology ,Drug Design ,Kernel (statistics) ,Regression Analysis ,Artificial intelligence ,Data mining ,business ,computer ,[CHIM.CHEM]Chemical Sciences/Cheminformatics - Abstract
In this paper, we demonstrate that the kernel target alignment (KTA) parameter can efficiently be used to estimate the relevance of molecular descriptors for QSAR modeling on a given data set, i.e., as a modelability measure. The efficiency of KTA to assess modelability was demonstrated in two series of QSAR modeling studies, either varying different descriptor spaces for one same data set, or comparing various data sets within one same descriptor space. Considered data sets included 25 series of various GPCR binders with ChEMBL-reported pKi values, and a toxicity data set. Employed descriptor spaces covered more than 100 different ISIDA fragment descriptor types, and ChemAxon BCUT terms. Model performances (RMSE) were seen to anticorrelate consistently with the KTA parameter. Two other modelability measures were employed for benchmarking purposes: the Jaccard distance average over the data set (Div), and a measure related to the normalized mean absolute error (MAE) obtained in 1-nearest neighbors calculations on the training set (Sim = 1 - MAE). It has been demonstrated that both Div and Sim perform similarly to KTA. However, a consensus index combining KTA, Div and Sim provides a more robust correlation with RMSE than any of the individual modelability measures.
- Published
- 2015
- Full Text
- View/download PDF
34. Stargate GTM: Bridging Descriptor and Activity Spaces
- Author
-
Gilles Marcou, Alexandre Varnek, Igor I. Baskin, Dragos Horvath, Helena Gaspar, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
General Chemical Engineering ,Quantitative Structure-Activity Relationship ,Multi-task learning ,Library and Information Sciences ,Weighted geometric mean ,Space (mathematics) ,Machine learning ,computer.software_genre ,01 natural sciences ,Set (abstract data type) ,03 medical and health sciences ,Lasso (statistics) ,Artificial Intelligence ,Molecular descriptor ,Humans ,Probability ,030304 developmental biology ,Mathematics ,0303 health sciences ,business.industry ,Pattern recognition ,General Chemistry ,0104 chemical sciences ,Computer Science Applications ,Random forest ,010404 medicinal & biomolecular chemistry ,Drug Design ,Computer-Aided Design ,Probability distribution ,Artificial intelligence ,business ,computer ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Algorithms - Abstract
Predicting the activity profile of a molecule or discovering structures possessing a specific activity profile are two important goals in chemoinformatics, which could be achieved by bridging activity and molecular descriptor spaces. In this paper, we introduce the "Stargate" version of the Generative Topographic Mapping approach (S-GTM) in which two different multidimensional spaces (e.g., structural descriptor space and activity space) are linked through a common 2D latent space. In the S-GTM algorithm, the manifolds are trained simultaneously in two initial spaces using the probabilities in the 2D latent space calculated as a weighted geometric mean of probability distributions in both spaces. S-GTM has the following interesting features: (1) activities are involved during the training procedure; therefore, the method is supervised, unlike conventional GTM; (2) using molecular descriptors of a given compound as input, the model predicts a whole activity profile, and (3) using an activity profile as input, areas populated by relevant chemical structures can be detected. To assess the performance of S-GTM prediction models, a descriptor space (ISIDA descriptors) of a set of 1325 GPCR ligands was related to a B-dimensional (B = 1 or 8) activity space corresponding to pKi values for eight different targets. S-GTM outperforms conventional GTM for individual activities and performs similarly to the Lasso multitask learning algorithm, although it is still slightly less accurate than the Random Forest method.
- Published
- 2015
- Full Text
- View/download PDF
35. Chemical Data Visualization and Analysis with Incremental Generative Topographic Mapping: Big Data Challenge
- Author
-
Igor I. Baskin, Alexandre Varnek, Dragos Horvath, Gilles Marcou, Helena Gaspar, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Similarity (geometry) ,Computer science ,Entropy ,General Chemical Engineering ,Big data ,Library and Information Sciences ,computer.software_genre ,01 natural sciences ,Small Molecule Libraries ,User-Computer Interface ,03 medical and health sciences ,Entropy (information theory) ,Bhattacharyya distance ,030304 developmental biology ,0303 health sciences ,business.industry ,General Chemistry ,Chemical space ,0104 chemical sciences ,Computer Science Applications ,Visualization ,Euclidean distance ,010404 medicinal & biomolecular chemistry ,Solubility ,Data mining ,business ,computer ,Algorithms ,Databases, Chemical ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Chemical database - Abstract
This paper is devoted to the analysis and visualization in 2-dimensional space of large data sets of millions of compounds using the incremental version of generative topographic mapping (iGTM). The iGTM algorithm implemented in the in-house ISIDA-GTM program was applied to a database of more than 2 million compounds combining data sets of 36 chemicals suppliers and the NCI collection, encoded either by MOE descriptors or by MACCS keys. Taking advantage of the probabilistic nature of GTM, several approaches to data analysis were proposed. The chemical space coverage was evaluated using the normalized Shannon entropy. Different views of the data (property landscapes) were obtained by mapping various physical and chemical properties (molecular weight, aqueous solubility, LogP, etc.) onto the iGTM map. The superposition of these views helped to identify the regions in the chemical space populated by compounds with desirable physicochemical profiles and the suppliers providing them. The data sets similarity in the latent space was assessed by applying several metrics (Euclidean distance, Tanimoto and Bhattacharyya coefficients) to data probability distributions based on cumulated responsibility vectors. As a complementary approach, data sets were compared by considering them as individual objects on a meta-GTM map, built on cumulated responsibility vectors or property landscapes produced with iGTM. We believe that the iGTM methodology described in this article represents a fast and reliable way to analyze and visualize large chemical databases.
- Published
- 2014
- Full Text
- View/download PDF
36. Virtual Screening with Generative Topographic Maps: How Many Maps Are Required?
- Author
-
Casciuc, Iuri, primary, Zabolotna, Yuliana, additional, Horvath, Dragos, additional, Marcou, Gilles, additional, Bajorath, Jürgen, additional, and Varnek, Alexandre, additional
- Published
- 2018
- Full Text
- View/download PDF
37. Do Not Hesitate to Use Tversky—and Other Hints for Successful Active Analogue Searches with Feature Count Descriptors
- Author
-
Dragos Horvath, Alexandre Varnek, Gilles Marcou, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Similarity (geometry) ,Databases, Pharmaceutical ,Computer science ,General Chemical Engineering ,Drug Evaluation, Preclinical ,Library and Information Sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Set (abstract data type) ,User-Computer Interface ,03 medical and health sciences ,Feature (machine learning) ,Data Mining ,030304 developmental biology ,Internet ,0303 health sciences ,Virtual screening ,business.industry ,Pattern recognition ,General Chemistry ,chEMBL ,0104 chemical sciences ,Computer Science Applications ,010404 medicinal & biomolecular chemistry ,Colored ,Metric (mathematics) ,Artificial intelligence ,Pharmacophore ,business ,computer ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Algorithms - Abstract
This study is an exhaustive analysis of the neighborhood behavior over a large coherent data set (ChEMBL target/ligand pairs of known Ki, for 165 targets with >50 associated ligands each). It focuses on similarity-based virtual screening (SVS) success defined by the ascertained optimality index. This is a weighted compromise between purity and retrieval rate of active hits in the neighborhood of an active query. One key issue addressed here is the impact of Tversky asymmetric weighing of query vs candidate features (represented as integer-value ISIDA colored fragment/pharmacophore triplet count descriptor vectors). The nearly a 3/4 million independent SVS runs showed that Tversky scores with a strong bias in favor of query-specific features are, by far, the most successful and the least failure-prone out of a set of nine other dissimilarity scores. These include classical Tanimoto, which failed to defend its privileged status in practical SVS applications. Tversky performance is not significantly conditioned by tuning of its bias parameter α. Both initial "guesses" of α = 0.9 and 0.7 were more successful than Tanimoto (at its turn, better than Euclid). Tversky was eventually tested in exhaustive similarity searching within the library of 1.6 M commercial + bioactive molecules at http://infochim.u-strasbg.fr/webserv/VSEngine.html , comparing favorably to Tanimoto in terms of "scaffold hopping" propensity. Therefore, it should be used at least as often as, perhaps in parallel to Tanimoto in SVS. Analysis with respect to query subclasses highlighted relationships of query complexity (simply expressed in terms of pharmacophore pattern counts) and/or target nature vs SVS success likelihood. SVS using more complex queries are more robust with respect to the choice of their operational premises (descriptors, metric). Yet, they are best handled by "pro-query" Tversky scores at α > 0.5. Among simpler queries, one may distinguish between "growable" (allowing for active analogs with additional features), and a few "conservative" queries not allowing any growth. These (typically bioactive amine transporter ligands) form the specific application domain of "pro-candidate" biased Tversky scores at α < 0.5.
- Published
- 2013
- Full Text
- View/download PDF
38. Models for Identification of Erroneous Atom-to-Atom Mapping of Reactions Performed by Automated Algorithms
- Author
-
Gilles Marcou, Alexandre Varnek, João Aires-de-Sousa, Dragos Horvath, Christophe Muller, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Support Vector Machine ,Computer science ,General Chemical Engineering ,Library and Information Sciences ,010402 general chemistry ,computer.software_genre ,Models, Biological ,01 natural sciences ,Automation ,03 medical and health sciences ,False Positive Reactions ,Databases, Protein ,030304 developmental biology ,0303 health sciences ,Training set ,Computational Biology ,General Chemistry ,0104 chemical sciences ,Computer Science Applications ,Support vector machine ,Automated algorithm ,Test set ,Kegg database ,Graph (abstract data type) ,Data mining ,computer ,[CHIM.CHEM]Chemical Sciences/Cheminformatics - Abstract
Machine learning (SVM and JRip rule learner) methods have been used in conjunction with the Condensed Graph of Reaction (CGR) approach to identify errors in the atom-to-atom mapping of chemical reactions produced by an automated mapping tool by ChemAxon. The modeling has been performed on the three first enzymatic classes of metabolic reactions from the KEGG database. Each reaction has been converted into a CGR representing a pseudomolecule with conventional (single, double, aromatic, etc.) bonds and dynamic bonds characterizing chemical transformations. The ChemAxon tool was used to automatically detect the matching atom pairs in reagents and products. These automated mappings were analyzed by the human expert and classified as "correct" or "wrong". ISIDA fragment descriptors generated for CGRs for both correct and wrong mappings were used as attributes in machine learning. The learned models have been validated in n-fold cross-validation on the training set followed by a challenge to detect correct and wrong mappings within an external test set of reactions, never used for learning. Results show that both SVM and JRip models detect most of the wrongly mapped reactions. We believe that this approach could be used to identify erroneous atom-to-atom mapping performed by any automated algorithm.
- Published
- 2012
- Full Text
- View/download PDF
39. Chemography: Searching for Hidden Treasures
- Author
-
Zabolotna, Yuliana, Lin, Arkadii, Horvath, Dragos, Marcou, Gilles, Volochnyuk, Dmitriy M., and Varnek, Alexandre
- Abstract
The days when medicinal chemistry was limited to a few series of compounds of therapeutic interest are long gone. Nowadays, no human may succeed to acquire a complete overview of more than a billion existing or feasible compounds within which the potential “blockbuster drugs” are well hidden and yet only a few mouse clicks away. To reach these “hidden treasures”, we adapted the generative topographic mapping method to enable efficient navigation through the chemical space, from a global overview to a structural pattern detection, covering, for the first time, the complete ZINC library of purchasable compounds, relative to 1.6 million biologically relevant ChEMBL molecules. About 40 000 hierarchical maps of the chemical space were constructed. Structural motifs inherent to only one library were identified. Roughly 20 000 off-market ChEMBL compound families represent incentives to enrich commercial catalogs. Alternatively, 125 000 ZINC-specific compound classes, absent in structure–activity bases, are novel paths to explore in medicinal chemistry. The complete list of these chemotypes can be downloaded using the link https://forms.gle/B6bUJj82t9EfmttV6.
- Published
- 2021
- Full Text
- View/download PDF
40. Privileged Structural Motif Detection and Analysis Using Generative Topographic Maps
- Author
-
Kayastha, Shilva, primary, Horvath, Dragos, additional, Gilberg, Erik, additional, Gütschow, Michael, additional, Bajorath, Jürgen, additional, and Varnek, Alexandre, additional
- Published
- 2017
- Full Text
- View/download PDF
41. Benchmarking of Linear and Nonlinear Approaches for Quantitative Structure−Property Relationship Studies of Metal Complexation with Ionophores
- Author
-
Nicolas Lachiche, Frank Hoonakker, Vitaly P. Solov'ev, Xiaojun Yao, Alexandre Varnek, Jean Doucet, Botao Fan, Igor V. Tetko, Piere Jost, Denis Fourches, and Alexey V. Antonov
- Subjects
Quantitative structure–activity relationship ,Silver ,Ionophores ,Wilcoxon signed-rank test ,Artificial neural network ,Software Validation ,General Chemical Engineering ,Linear model ,Quantitative Structure-Activity Relationship ,General Chemistry ,Models, Theoretical ,Library and Information Sciences ,Computer Science Applications ,k-nearest neighbors algorithm ,Support vector machine ,Europium ,Nonlinear Dynamics ,Molecular descriptor ,Statistics ,Linear regression ,Linear Models ,Organometallic Compounds ,Biological system ,Algorithms ,Mathematics - Abstract
A benchmark of several popular methods, Associative Neural Networks (ANN), Support Vector Machines (SVM), k Nearest Neighbors (kNN), Maximal Margin Linear Programming (MMLP), Radial Basis Function Neural Network (RBFNN), and Multiple Linear Regression (MLR), is reported for quantitative-structure property relationships (QSPR) of stability constants logK1 for the 1:1 (M:L) and logbeta2 for 1:2 complexes of metal cations Ag+ and Eu3+ with diverse sets of organic molecules in water at 298 K and ionic strength 0.1 M. The methods were tested on three types of descriptors: molecular descriptors including E-state values, counts of atoms determined for E-state atom types, and substructural molecular fragments (SMF). Comparison of the models was performed using a 5-fold external cross-validation procedure. Robust statistical tests (bootstrap and Kolmogorov-Smirnov statistics) were employed to evaluate the significance of calculated models. The Wilcoxon signed-rank test was used to compare the performance of methods. Individual structure-complexation property models obtained with nonlinear methods demonstrated a significantly better performance than the models built using multilinear regression analysis (MLRA). However, the averaging of several MLRA models based on SMF descriptors provided as good of a prediction as the most efficient nonlinear techniques. Support Vector Machines and Associative Neural Networks contributed in the largest number of significant models. Models based on fragments (SMF descriptors and E-state counts) had higher prediction ability than those based on E-state indices. The use of SMF descriptors and E-state counts provided similar results, whereas E-state indices lead to less significant models. The current study illustrates the difficulties of quantitative comparison of different methods: conclusions based only on one data set without appropriate statistical tests could be wrong.
- Published
- 2006
- Full Text
- View/download PDF
42. Applicability domains for classification problems: Benchmarking of distance to models for Ames mutagenicity set
- Author
-
Robert Körner, Gilles Marcou, Huanxiang Liu, Dragos Horvath, Roberto Todeschini, Phuong Dao, Xiaojun Yao, Douglas M. Young, Paola Gramatica, A. Varnek, A. Artemenko, Todd M. Martin, Anil Kumar Pandey, Farhad Hormozdiari, Eugene N. Muratov, Alexander Tropsha, Christophe Muller, Artem Cherkasov, Tomas Öberg, Katja Hansen, Lili Xi, Timon Schroeter, Pavel G. Polishchuk, Sergii Novotarskyi, Jiazhong Li, Volodymyr V. Prokopenko, Denis Fourches, Victor E. Kuz’min, Cenk Sahinalp, Igor I. Baskin, Klaus-Robert Müller, Igor V. Tetko, Iurii Sushko, Chimie de la matière complexe (CMC), Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Sushko, I, Novotarskyi, S, Körner, R, Pandey, A, Cherkasov, A, Li, J, Gramatica, P, Hansen, K, Schroeter, T, Müller, K, Xi, L, Liu, H, Yao, X, Öberg, T, Hormozdiari, F, Dao, P, Sahinalp, C, Todeschini, R, Polishchuk, P, Artemenko, A, Kuz'Min, V, Martin, T, Young, D, Fourches, D, Tropsha, A, Baskin, I, Horbath, D, Marcou, G, Varnek, A, Prokopenko, V, and Tetko, I
- Subjects
Quantitative structure–activity relationship ,General Chemical Engineering ,Quantitative Structure-Activity Relationship ,Library and Information Sciences ,computer.software_genre ,01 natural sciences ,Standard deviation ,Set (abstract data type) ,03 medical and health sciences ,CHIM/01 - CHIMICA ANALITICA ,Similarity (network science) ,030304 developmental biology ,Mathematics ,0303 health sciences ,Principal Component Analysis ,QSAR ,Mutagenicity Tests ,mutagenicity ,General Chemistry ,Classification ,0104 chemical sciences ,Computer Science Applications ,Ames test ,Data set ,010404 medicinal & biomolecular chemistry ,Benchmarking ,Test set ,Metric (mathematics) ,Data mining ,computer ,Algorithm ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Applicability domain - Abstract
The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
- Published
- 2010
- Full Text
- View/download PDF
43. Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection
- Author
-
Tomas Öberg, Anil Kumar Pandey, Roberto Todeschini, Denis Fourches, Alexander Tropsha, Alexandre Varnek, Igor V. Tetko, Iurii Sushko, Ester Papa, Hao Zhu, Chimie de la matière complexe (CMC), Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Tetko, I, Sushko, I, Pandey, A, Zhu, H, Tropsha, A, Papa, E, Oberg, T, Todeschini, R, Fourches, D, and Varnek, A
- Subjects
Quantitative structure–activity relationship ,Databases, Factual ,QSAR, validation, applicability domain, variable selection ,General Chemical Engineering ,Normal Distribution ,Quantitative Structure-Activity Relationship ,Feature selection ,Library and Information Sciences ,Overfitting ,01 natural sciences ,Models, Biological ,Standard deviation ,03 medical and health sciences ,CHIM/01 - CHIMICA ANALITICA ,Predictive Value of Tests ,Statistics ,Toxicity Tests ,Leverage (statistics) ,Animals ,Computer Simulation ,030304 developmental biology ,Statistical hypothesis testing ,Mathematics ,0303 health sciences ,Models, Statistical ,Tetrahymena pyriformis ,Reproducibility of Results ,General Chemistry ,0104 chemical sciences ,Computer Science Applications ,010404 medicinal & biomolecular chemistry ,Test set ,Environmental Pollutants ,Biological system ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Applicability domain - Abstract
The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site. © 2008 American Chemical Society.
- Published
- 2008
- Full Text
- View/download PDF
44. Automatized Assessment of Protective Group Reactivity: A Step Toward Big Reaction Data Analysis
- Author
-
Lin, Arkadii I., primary, Madzhidov, Timur I., additional, Klimchuk, Olga, additional, Nugmanov, Ramil I., additional, Antipin, Igor S., additional, and Varnek, Alexandre, additional
- Published
- 2016
- Full Text
- View/download PDF
45. Prediction of Activity Cliffs Using Condensed Graphs of Reaction Representations, Descriptor Recombination, Support Vector Machine Classification, and Support Vector Regression
- Author
-
Horvath, Dragos, primary, Marcou, Gilles, additional, Varnek, Alexandre, additional, Kayastha, Shilva, additional, de la Vega de León, Antonio, additional, and Bajorath, Jürgen, additional
- Published
- 2016
- Full Text
- View/download PDF
46. Structural and Physico-Chemical Interpretation (SPCI) of QSAR Models and Its Comparison with Matched Molecular Pair Analysis
- Author
-
Polishchuk, Pavel, primary, Tinkov, Oleg, additional, Khristova, Tatiana, additional, Ognichenko, Ludmila, additional, Kosinskaya, Anna, additional, Varnek, Alexandre, additional, and Kuz’min, Victor, additional
- Published
- 2016
- Full Text
- View/download PDF
47. Chemical Space Mapping and Structure–Activity Analysis of the ChEMBL Antiviral Compound Set
- Author
-
Klimenko, Kyrylo, primary, Marcou, Gilles, additional, Horvath, Dragos, additional, and Varnek, Alexandre, additional
- Published
- 2016
- Full Text
- View/download PDF
48. Generative topographic mapping-based classification models and their applicability domain: application to the biopharmaceutics Drug Disposition Classification System (BDDCS)
- Author
-
Gilles Marcou, Sylvain Lozano, Philippe Vayer, Helena Gaspar, Alban Arault, Alexandre Varnek, Dragos Horvath, Chimie de la matière complexe (CMC), and Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Prescription Drugs ,Computer science ,Databases, Pharmaceutical ,General Chemical Engineering ,Entropy ,Image processing ,Library and Information Sciences ,01 natural sciences ,Biopharmaceutics ,03 medical and health sciences ,Data visualization ,Molecular descriptor ,Entropy (information theory) ,Humans ,030304 developmental biology ,0303 health sciences ,Biological Products ,Models, Statistical ,Drug disposition ,business.industry ,Pattern recognition ,General Chemistry ,Drugs, Investigational ,0104 chemical sciences ,Computer Science Applications ,010404 medicinal & biomolecular chemistry ,Generative model ,Solubility ,Artificial intelligence ,business ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Algorithms ,Software ,Applicability domain - Abstract
Earlier (Kireeva et al. Mol. Inf. 2012, 31, 301–312), we demonstrated that generative topographic mapping (GTM) can be efficiently used both for data visualization and building of classification models in the initial D-dimensional space of molecular descriptors. Here, we describe the modeling in two-dimensional latent space for the four classes of the BioPharmaceutics Drug Disposition Classification System (BDDCS) involving VolSurf descriptors. Three new definitions of the applicability domain (AD) of models have been suggested: one class-independent AD which considers the GTM likelihood and two class-dependent ADs considering respectively, either the predominant class in a given node of the map or informational entropy. The class entropy AD was found to be the most efficient for the BDDCS modeling. The predominant class AD can be directly visualized on GTM maps, which helps the interpretation of the model.
- Published
- 2013
- Full Text
- View/download PDF
49. Predicting ligand binding modes from neural networks trained on protein-ligand interaction fingerprints
- Author
-
Gilles Marcou, Igor I. Baskin, Vladimir Chupakhin, Alexandre Varnek, Didier Rognan, Chimie de la matière complexe (CMC), Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Innovation Thérapeutique (LIT), and Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)
- Subjects
General Chemical Engineering ,Quantitative Structure-Activity Relationship ,Library and Information Sciences ,Machine learning ,computer.software_genre ,Crystallography, X-Ray ,Ligands ,01 natural sciences ,Peptide Mapping ,Protein Structure, Secondary ,Mitogen-Activated Protein Kinase 14 ,03 medical and health sciences ,Humans ,HSP90 Heat-Shock Proteins ,030304 developmental biology ,0303 health sciences ,Virtual screening ,Binding Sites ,Artificial neural network ,Chemistry ,business.industry ,Cyclin-Dependent Kinase 2 ,General Chemistry ,Ligand (biochemistry) ,0104 chemical sciences ,Computer Science Applications ,Molecular Docking Simulation ,010404 medicinal & biomolecular chemistry ,Protein–ligand docking ,Docking (molecular) ,Artificial intelligence ,Neural Networks, Computer ,Biological system ,business ,computer ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Protein ligand ,Protein Binding - Abstract
We herewith present a novel approach to predict protein–ligand binding modes from the single two-dimensional structure of the ligand. Known protein–ligand X-ray structures were converted into binary bit strings encoding protein–ligand interactions. An artificial neural network was then set up to first learn and then predict protein–ligand interaction fingerprints from simple ligand descriptors. Specific models were constructed for three targets (CDK2, p38-α, HSP90-α) and 146 ligands for which protein–ligand X-ray structures are available. These models were able to predict protein–ligand interaction fingerprints and to discriminate important features from minor interactions. Predicted interaction fingerprints were successfully used as descriptors to discriminate true ligands from decoys by virtual screening. In some but not all cases, the predicted interaction fingerprints furthermore enable to efficiently rerank cross-docking poses and prioritize the best possible docking solutions.
- Published
- 2013
- Full Text
- View/download PDF
50. CovaDOTS: In SilicoChemistry-Driven Tool to Design Covalent Inhibitors Using a Linking Strategy
- Author
-
Hoffer, Laurent, Saez-Ayala, Magali, Horvath, Dragos, Varnek, Alexandre, Morelli, Xavier, and Roche, Philippe
- Abstract
We recently reported an integrated fragment-based optimization strategy called DOTS (Diversity Oriented Target-focused Synthesis) that combines automated virtual screening (VS) with semirobotized organic synthesis coupled to in vitroevaluation. The molecular modeling part consists of hit-to-lead chemistry, based on the growing paradigm. Here, we have extended the applicability of the DOTS strategy by adding new functionalities, allowing a generic chemistry-driven linking approach with a particular emphasis on covalent drugs. Indeed, the covalent mode of action can be described as a specific case of linking, where suitable linkers are sought to fuse a bound organic compound with a nucleophilic protein side chain. The proof of concept is established using three retrospective study cases in which known noncovalent inhibitors have been converted to covalent inhibitors. Our method is able to automatically design reference covalent inhibitors (and/or analogs) from an initial activated substructure and predict their binding mode. More importantly, the reference compounds are ranked high among several hundred putative adducts, demonstrating the utility of the approach to design covalent inhibitors.
- Published
- 2019
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.