The Emergence of Abstract Thought in Large Language Models Beyond Any Language

In this work, we find that LLMs progressively develop a core language-agnostic parameter space. This compact yet critical set of parameters underlies the model’s ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system.

Specifically, we identify language-related neurons—those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one).

As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence.

Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages.

⚙️ Installation

Set up a virtualenv and install the pytorch manually. After that, install all the dependencies listed in the requirements.txt file by running the following command:

pip install -r requirements.txt

Our experiments have been tested on Python 3.12.9 with transformers 4.51.3.

📚 Datasets

Download dataset from following links:

Dataset	Description	🤗 Download
Detection	Comprises 1000 sentences per language across 6 languages from OSCAR dataset, used for identification of language-related neurons.	Link
Training	A multilingual corpus with at least 100k samples per language from CulturaX, MADLAD, and Wikipedia, used for targeted neuron pretraining.	Link
Evaluation	Includes MMMLU and MGSM datasets for measuring multilingual performance on high-, medium-, and low-resource languages.	Link

⌛️ Quick Start

After placing data in the ./dataset folder, you can run the following scripts to replicate key stages of our pipeline:

By running the following command, you will detect language-related neurons across multiple languages in a given LLM:

bash detection.sh

By running the following command, you will deactivate language-specific neurons and obtain a modified LLM variant:

bash deactivation.sh

By running the following command, you will pretrain the LLM with language-specific data to enhance its performance in that language:

bash train.sh

📊 Enhancement Result

📖 Citation

If you find our repo useful, please consider citing

@misc{chen2025abstractthought,
      title={The Emergence of Abstract Thought in Large Language Models Beyond Any Language}, 
      author={Yuxin Chen and Yiran Zhao and Yang Zhang and An Zhang and Kawaguchi Kenji and Shafiq Joty and Junnan Li and Tat-Seng Chua and Michael Qizhe Shish and Wenxuan Zhang},
      year={2025},
      eprint={2506.xxxxx},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.xxxx}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dataset		dataset
figures		figures
neuron		neuron
transformers_custom		transformers_custom
.DS_Store		.DS_Store
README.md		README.md
deactivation.py		deactivation.py
deactivation.sh		deactivation.sh
detect.py		detect.py
detect.sh		detect.sh
neurons_statistics.py		neurons_statistics.py
requirements.txt		requirements.txt
run.sh		run.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Emergence of Abstract Thought in Large Language Models Beyond Any Language

📋 Catalogue

⚙️ Installation

📚 Datasets

⌛️ Quick Start

📊 Enhancement Result

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Emergence of Abstract Thought in Large Language Models Beyond Any Language

📋 Catalogue

⚙️ Installation

📚 Datasets

⌛️ Quick Start

📊 Enhancement Result

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages