201. SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning
- Author
-
Xinhao Li and Denis Fourches
- Subjects
Vocabulary ,Simplified molecular-input line-entry system ,Computer science ,General Chemical Engineering ,media_common.quotation_subject ,Quantitative Structure-Activity Relationship ,Library and Information Sciences ,01 natural sciences ,Deep Learning ,Encoding (memory) ,0103 physical sciences ,Humans ,media_common ,computer.programming_language ,010304 chemical physics ,business.industry ,Deep learning ,Cheminformatics ,General Chemistry ,computer.file_format ,Python (programming language) ,chEMBL ,Substring ,0104 chemical sciences ,Computer Science Applications ,010404 medicinal & biomolecular chemistry ,Tokenization (data security) ,Artificial intelligence ,business ,Algorithm ,computer ,Algorithms - Abstract
Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure-activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.
- Published
- 2020