51. LDA Topic Modeling for Bioinformatics Terms in arXiv Documents.
- Author
-
Karnyoto, Andrea Stevens, Henry, Matthew Martianus, and Pardamean, Bens
- Subjects
MATHEMATICAL physics ,MOLECULAR biology ,COMPUTER science ,AMINO acids ,NUCLEOTIDES - Abstract
A wide variety of disciplines contribute to bioinformatics research, including computer science, biology, chemistry, mathematics, and physics. This study determines the number of research articles published on arXiv classified as bioinformatics topics and the most frequently used bioinformatics terms using topic modeling, Latent Dirichlet Allocation (LDA). An algorithm based on LDA is used to discover topics hidden within large collections of documents through the use of statistical analysis. Our research examined 226453 articles on arXiv between January 2023 and January 2024. As a result, there are more than 10521 articles categorized into bioinformatics topics. Most commonly, 6352 documents are in the "Mathematical Physics" category. The second most popular category is "Computer Science," with 2950 documents. Accordingly, the terms 'RNA,' 'sequence,' 'tree,' and 'homology' are the three most commonly used terms in bioinformatics. The study of RNA plays a vital role in molecular biology; thus, the study of RNA is prevalent in bioinformatics. Sequential data refer to the order in which nucleotides or amino acids can be found in a DNA molecule or a protein. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF