Mat-Instructions: A Large-Scale Inorganic Material Instruction Dataset for Large Language Models

1,2Ke Liu ✉, 1Shangde Gao , 1Yichao Fu , 2Xiaoliang Wu , 1Shuo Tong , 2Ajitha Rajan
IJCAI'25
1Zhejiang University 2 University of Edinburgh
✉ denotes correspondance
* denotes equal contribution

Abstract

Recent advancements in large language models (LLMs) have revolutionized research discovery across various scientific disciplines, including materials science. The discovery of novel materials, particularly crystal materials, is essential for achieving sustainable development goals (SDGs), as they drive breakthroughs in climate change mitigation, clean and affordable energy, and the promotion of industrial innovation. However, unlocking the full potential of LLMs in materials research remains challenging due to the lack of high-quality, diverse, and instruction-based datasets. Such datasets are crucial for guiding these models in understanding and predicting the structure, property, and function of materials across various tasks. To address this limitation, we introduce Mat-Instruction, a large-scale inorganic material instruction dataset, specifically designed to unlock the potential of LLMs in materials science. Extensive experiments on fine-tuning LLaMA with our Mat-Instruction dataset demonstrate its effectiveness in advancing progress for materials science.

Mat-Instruction Dataset Construction

The construction of Mat-Instruction dataset consists of five main steps as follows:


Tasks

  • Crystal Structure Prediction Instructions: Given the chemical formula of a material, the LLM is supposed to predict its crystal structure.
  • Property Prediction Instructions: Given a crystal structure, the LLM is supposed to predict its properties, such as band gap, formation energy, elastic constant, and other properties.
  • Description-guided Crystal Design Instructions: Given the description of a material, the LLM is supposed to design a crystal structure that meets the requirements specified in the description. The description includes specific properties, space groups, and other constraints.
  • Crystal Reaction Instructions: Given the reactants of a chemical reaction, the LLM is supposed to predict the products of the reaction.
  • Crystal Retrosynthesis Instructions: Given a chemical formula, the LLM is supposed to predict the synthesis pathways, conditions, and outcomes of the material.
  • Crystal Description Instructions: Given a chemical formula, the LLM is supposed to describe the properties and structures of the material.
  • Statistics of Mat-Instruction

    Mat-Instruction comprises a total of 349,090 instructions. 183,654 property prediction instructions, 51,027 CSP instructions, 35,675 crystal retrosynthesis instructions, 20,502 crystal reaction instructions, 7,205 crystal description instructions, and 51,027 description-guided crystal design instructions.

    Stat 1 Stat 2

    Experiments

    We conduct extensive experiments on fine-tuning LLaMA with our Mat-Instruction dataset. The results demonstrate the effectiveness of our dataset in advancing progress for materials science.

    Experimental Results

    BibTeX

    
    @inproceedings{Mat-Instruct,
      title={Mat-Instructions: A Large-Scale Inorganic Material Instruction Dataset for Large Language Models},
      author={Ke Liu and Shangde Gao and Yichao Fu and Xiaoliang Wu and Shuo Tong and Ajitha Rajan},
      booktitle={IJCAI 2025},
      year={2025},
    }