This is the official implementation for paper "PENCIL: Long Thoughts with Short Memory".
PENCIL is a generation paradigm that enables language models to generate very long CoT using a small context window, for solving larger-scale and more complicated reasoning problems. In short, PENCIL incorporates a cleaning mechanism into CoT that periodically and recursively eliminates intermediate thoughts that are no longer needed for generating future thoughts to manage space efficiently. Theoretically, PENCIL with transformer base model can provably solve any computational tasks with optimal time and space efficiency.
Materials: Paper
- Install PyTorch according to instructions in the offical website.
- Install other required packages via
pip install -r requirements.txtSee scripts in the scripts folder.
Create a folder /data and prepare datasets by running the following code:
For SAT and QBF
python dataset_sat.py \
--num_samples=102000 \
--train_size=5000 \
--data_dir=data/${dataset} \
--min_vars=5 \
--max_vars=5This script creates a total of --num_samples instances that include, by default, 1,000 validation samples and 1,000 test samples (which can be adjusted in the process_dataset function). The remaining examples form the training set; note that the training set is split into multiple files, each containing --train_size training instances, for large-scale experiments where one file could be very large.
The --min_vars and --max_vars aruguments set the minimum and maximum number of variables per instance; if these values are equal, as in our experiments, all instances will have the same number of variables, but if they differ, the dataset will include a mix of instances with varying numbers of variables.
For Einstein's puzzle
python einstein_generator.py \
--num_samples 100000 \
--data_dir data/${dataset} \
--size 5 \
--minimal_conditions \
--save \
python einstein_solver.py \
--data_dir data/${dataset} \
--train_size 5000 \For Einstein's Puzzle, first run einstein_generator.py to create a dataset of puzzle instances (with clues and solutions but no reasoning steps). Then, run einstein_solver.py to process the generated puzzles, which solves these puzzles and generate the reasoning steps (with special tokens for training the model).
For SAT and QBF:
python train.py \
config/config_3sat.py \
--dataset=$dataset \
--data_dir=data \
--device=cuda \
--format=pencil \For Einstein's puzzle
python train_puzzle.py \
config/config_puzzle.py \
--dataset=$dataset \
--data_dir=data \
--device=cuda \
--format=pencilIf you find our codes useful, please consider citing our work
@misc{yang2025pencil,
title={PENCIL: Long Thoughts with Short Memory},
author={Chenxiao Yang and Nathan Srebro and David McAllester and Zhiyuan Li},
year={2025},
eprint={2503.14337},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.14337},
}We thank nanoGPT and Puzzle-Generator-and-Solver for providing useful implementations.
