1. Smart distributed data factory volunteer computing platform for active learning-driven molecular data acquisition
- Author
-
Tsolak Ghukasyan, Vahagn Altunyan, Aram Bughdaryan, Tigran Aghajanyan, Khachik Smbatyan, Garegin A. Papoian, and Garik Petrosyan
- Subjects
Volunteer computing ,Active learning ,Conformational energy ,Machine learning ,Graph neural networks ,Medicine ,Science - Abstract
Abstract This paper presents the smart distributed data factory (SDDF), an AI-driven distributed computing platform designed to address challenges in drug discovery by creating comprehensive datasets of molecular conformations and their properties. SDDF uses volunteer computing, leveraging the processing power of personal computers worldwide to accelerate quantum chemistry (DFT) calculations. To tackle the vast chemical space and limited high-quality data, SDDF employs an ensemble of machine learning (ML) models to predict molecular properties and selectively choose the most challenging data points for further DFT calculations. The platform also generates new molecular conformations using molecular dynamics with the forces derived from these models. SDDF makes several contributions: the volunteer computing platform for DFT calculations; an active learning framework for constructing a dataset of molecular conformations; a large public dataset of diverse ENAMINE molecules with calculated energies; an ensemble of ML models for accurate energy prediction. The energy dataset was generated to validate the SDDF approach of reducing the need for extensive calculations. With its strict scaffold split, the dataset can be used for training and benchmarking energy models. By combining active learning, distributed computing, and quantum chemistry, SDDF offers a scalable, cost-effective solution for developing accurate molecular models and ultimately accelerating drug discovery.
- Published
- 2025
- Full Text
- View/download PDF