Back to Search Start Over

Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

Authors :
Yeh, Ying-Ting
Ock, Janghoon
Farimani, Amir Barati
Publication Year :
2025

Abstract

In this study, we explore the use of a transformer-based language model as an encoder to predict the band gaps of semiconductor materials directly from their text descriptions. Quantum chemistry simulations, including Density Functional Theory (DFT), are computationally intensive and time-consuming, which limits their practicality for high-throughput material screening, particularly for complex systems. Shallow machine learning (ML) models, while effective, often require extensive data preprocessing to convert non-numerical material properties into numerical inputs. In contrast, our approach leverages textual data directly, bypassing the need for complex feature engineering. We generate material descriptions in two formats: formatted strings combining features and natural language text generated using the ChatGPT API. We demonstrate that the RoBERTa model, pre-trained on natural language processing tasks, performs effectively as an encoder for prediction tasks. With minimal fine-tuning, it achieves a mean absolute error (MAE) of approximately 0.33 eV, performing better than shallow machine learning models such as Support Vector Regression, Random Forest, and XGBoost. Even when only the linear regression head is trained while keeping the RoBERTa encoder layers frozen, the accuracy remains nearly identical to that of the fully trained model. This demonstrates that the pre-trained RoBERTa encoder is highly adaptable for processing domain-specific text related to material properties, such as the band gap, significantly reducing the need for extensive retraining. This study highlights the potential of transformer-based language models to serve as efficient and versatile encoders for semiconductor materials property prediction tasks.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2501.03456
Document Type :
Working Paper