Recent advancements in large language models (LLMs) have revolutionized research discovery across various scientific disciplines, including materials science. The discovery of novel materials, particularly crystal materials, is essential for achieving sustainable development goals (SDGs), as they drive breakthroughs in climate change mitigation, clean and affordable energy, and the promotion of industrial innovation. However, unlocking the full potential of LLMs in materials research remains challenging due to the lack of high-quality, diverse, and instruction-based datasets. Such datasets are crucial for guiding these models in understanding and predicting the structure, property, and function of materials across various tasks. To address this limitation, we introduce Mat-Instruction, a large-scale inorganic material instruction dataset, specifically designed to unlock the potential of LLMs in materials science. Extensive experiments on fine-tuning LLaMA with our Mat-Instruction dataset demonstrate its effectiveness in advancing progress for materials science.
The construction of Mat-Instruction dataset consists of five main steps as follows:
Mat-Instruction comprises a total of 349,090 instructions. 183,654 property prediction instructions, 51,027 CSP instructions, 35,675 crystal retrosynthesis instructions, 20,502 crystal reaction instructions, 7,205 crystal description instructions, and 51,027 description-guided crystal design instructions.
We conduct extensive experiments on fine-tuning LLaMA with our Mat-Instruction dataset. The results demonstrate the effectiveness of our dataset in advancing progress for materials science.
@inproceedings{Mat-Instruct,
title={Mat-Instructions: A Large-Scale Inorganic Material Instruction Dataset for Large Language Models},
author={Ke Liu and Shangde Gao and Yichao Fu and Xiaoliang Wu and Shuo Tong and Ajitha Rajan},
booktitle={IJCAI 2025},
year={2025},
}