Berishvili VP, Perkin VO, Voronkov AE, Radchenko EV, Syed R, Venkata Ramana Reddy C, Pillay V, Kumar P, Choonara YE, Kamal A, and Palyulin VA
Molecular dynamics simulations provide valuable insights into the behavior of molecular systems. Extending the recent trend of using machine learning techniques to predict physicochemical properties from molecular dynamics data, we propose to consider the trajectories as multidimensional time series represented by 2D tensors containing the ligand-protein interaction descriptor values for each time step. Similar in structure to the time series encountered in modern approaches for signal, speech, and natural language processing, these time series can be directly analyzed using long short-term memory (LSTM) recurrent neural networks or convolutional neural networks (CNNs). The predictive regression models for the ligand-protein affinity were built for a subset of the PDBbind v.2017 database and applied to inhibitors of tankyrase, an enzyme of the poly(ADP-ribose)-polymerase (PARP) family that can be used in the treatment of colorectal cancer. As an additional test set, a subset of the Community Structure-Activity Resource (CSAR) data set was used. For comparison, the random forest and simple neural network models based on the crystal pose or the trajectory-averaged descriptors were used, as well as the commonly employed docking and molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) scores. Convolutional neural networks based on the 2D tensors of ligand-protein interaction descriptors for short (2 ns) trajectories provide the best accuracy and predictive power, reaching the Spearman rank correlation coefficient of 0.73 and Pearson correlation coefficient of 0.70 for the tankyrase test set. Taking into account the recent increase in computational power of modern GPUs and relatively low computational complexity of the proposed approach, it can be used as an advanced virtual screening filter for compound prioritization.