In this work, we find that LLMs progressively develop a core language-agnostic parameter space. This compact yet critical set of parameters underlies the model’s ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system.
Specifically, we identify language-related neurons—those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one).
As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence.
Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages.
Set up a virtualenv and install the pytorch manually. After that, install all the dependencies listed in the requirements.txt file by running the following command:
pip install -r requirements.txtOur experiments have been tested on Python 3.12.9 with transformers 4.51.3.
Download dataset from following links:
| Dataset | Description | 🤗 Download |
|---|---|---|
| Detection | Comprises 1000 sentences per language across 6 languages from OSCAR dataset, used for identification of language-related neurons. | Link |
| Training | A multilingual corpus with at least 100k samples per language from CulturaX, MADLAD, and Wikipedia, used for targeted neuron pretraining. | Link |
| Evaluation | Includes MMMLU and MGSM datasets for measuring multilingual performance on high-, medium-, and low-resource languages. | Link |
After placing data in the ./dataset folder, you can run the following scripts to replicate key stages of our pipeline:
By running the following command, you will detect language-related neurons across multiple languages in a given LLM:
bash detection.shBy running the following command, you will deactivate language-specific neurons and obtain a modified LLM variant:
bash deactivation.shBy running the following command, you will pretrain the LLM with language-specific data to enhance its performance in that language:
bash train.shIf you find our repo useful, please consider citing
@misc{chen2025abstractthought,
title={The Emergence of Abstract Thought in Large Language Models Beyond Any Language},
author={Yuxin Chen and Yiran Zhao and Yang Zhang and An Zhang and Kawaguchi Kenji and Shafiq Joty and Junnan Li and Tat-Seng Chua and Michael Qizhe Shish and Wenxuan Zhang},
year={2025},
eprint={2506.xxxxx},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.xxxx},
}

