Pu, Qiumei, Li, Yinghao, Zhang, Hong, Yao, Haodong, Zhang, Bo, Hou, Bingji, Li, Lin, Zhao, Yuliang, and Zhao, Lina
In view of huge search space in drug design, machine learning has become a powerful method to predict the affinity between small molecular drug and targeting protein with the development of artificial intelligence technology. However, various machine learning algorithms including massive different parameters make the prediction framework choice to be quite difficult. In this work, we took a recent drug design competition (from XtalPi company on the DataCastle platform) as the typical case to find the optimized parameters for different machines learning algorithms and the most effective algorithm. After the parameter optimizations, we compared the typical machine learning methods as decision tree (XGBoost, LightGBM) and artificial neural network (MLP, CNN) with root-mean-square error (RMSE) and coefficient of determination (R2) evaluation. As a result, decision tree is more effective than the neural network as LightGBM>XGBoost>CNN>MLP in the affinity prediction of the specific drug design problem with ~160000 samples. For a much larger screening task in a more complicated drug design study, the sophisticated neural network model may go beyond the decision tree algorithm after generalization enhancing and overfitting reducing. The advanced machine learning methods could extract more information of protein-ligand bindings than traditional ones and improve the screen efficiency of drug design up to 200–1000 times.