Back to Search
Start Over
PySpark and RDKit: Moving towards Big Data in Cheminformatics
- Publication Year :
- 2019
-
Abstract
- The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented ; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark-RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low-end workstations.
- Subjects :
- Big Data
Source code
Workstation
Computer science
media_common.quotation_subject
Big data
Datasets as Topic
computer.software_genre
01 natural sciences
law.invention
QSAR
Hadoop
Apache Spark
Python
pandas
03 medical and health sciences
Structural Biology
law
Drug Discovery
030304 developmental biology
media_common
computer.programming_language
0303 health sciences
Distributed Computing Environment
Database
business.industry
Cheminformatics
Organic Chemistry
Python (programming language)
0104 chemical sciences
Computer Science Applications
010404 medicinal & biomolecular chemistry
Analytics
Scalability
Molecular Medicine
business
computer
Software
Subjects
Details
- Language :
- English
- Database :
- OpenAIRE
- Accession number :
- edsair.doi.dedup.....c4fbb5fd3edb3da90f10902a102e4636