Skip to content

chenyuxin1999/Abstract_Thought

Repository files navigation

The Emergence of Abstract Thought in Large Language Models Beyond Any Language

HuggingFace space arXiv license

In this work, we find that LLMs progressively develop a core language-agnostic parameter space. This compact yet critical set of parameters underlies the model’s ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system.

Specifically, we identify language-related neurons—those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one).

As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence.

percentage deactivation

Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages.

📋 Catalogue

⚙️ Installation

Set up a virtualenv and install the pytorch manually. After that, install all the dependencies listed in the requirements.txt file by running the following command:

pip install -r requirements.txt

Our experiments have been tested on Python 3.12.9 with transformers 4.51.3.

📚 Datasets

Download dataset from following links:

Dataset Description 🤗 Download
Detection Comprises 1000 sentences per language across 6 languages from OSCAR dataset, used for identification of language-related neurons. Link
Training A multilingual corpus with at least 100k samples per language from CulturaX, MADLAD, and Wikipedia, used for targeted neuron pretraining. Link
Evaluation Includes MMMLU and MGSM datasets for measuring multilingual performance on high-, medium-, and low-resource languages. Link

⌛️ Quick Start

After placing data in the ./dataset folder, you can run the following scripts to replicate key stages of our pipeline:

By running the following command, you will detect language-related neurons across multiple languages in a given LLM:

bash detection.sh

By running the following command, you will deactivate language-specific neurons and obtain a modified LLM variant:

bash deactivation.sh

By running the following command, you will pretrain the LLM with language-specific data to enhance its performance in that language:

bash train.sh

📊 Enhancement Result

results

📖 Citation

If you find our repo useful, please consider citing

@misc{chen2025abstractthought,
      title={The Emergence of Abstract Thought in Large Language Models Beyond Any Language}, 
      author={Yuxin Chen and Yiran Zhao and Yang Zhang and An Zhang and Kawaguchi Kenji and Shafiq Joty and Junnan Li and Tat-Seng Chua and Michael Qizhe Shish and Wenxuan Zhang},
      year={2025},
      eprint={2506.xxxxx},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.xxxx}, 
}

About

[NeurIPS 2025] The implementation of paper "The Emergence of Abstract Thought in Large Language Models Beyond Any Language"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors