Self-Improvement without External Feedback (arXiv)
Test-time Recursive Thinking (TRT) is an agentic framework that enables LLMs to self-improve during inference through iterative reflection—without requiring external feedback or reward signals.
TRT operates in three stages:
- Generate: The model produces multiple solution candidates for a given problem
- Select: Solutions are evaluated and the best candidates are identified using self-consistency or verification
- Reflect: The model analyzes successful and failed attempts to extract generalizable insights, which inform subsequent generation rounds
This recursive process allows the model to accumulate knowledge within a session, progressively improving solution quality through self-directed learning.
- Python >= 3.10
- CUDA-compatible GPU (recommended for vLLM-based experiments)
- Azure OpenAI API access (for LiveCodeBench experiments)
git clone https://github.com/YufanZhuang/test-time-recursive-thinking.git
cd test-time-recursive-thinking
./setup_env.sh# Install AIME dependencies
cd AIME
pip install -r requirements.txt
# Install LiveCodeBench
cd ../LiveCodeBench
pip install -e .
# Install MCP server dependencies (for TRT agentic mode)
pip install mcp fastmcp aiofiles orjsoncd AIME/bash_scripts
bash qwen3.sh # Run Qwen3-235B evaluation
bash gpt_oss.sh # Run GPT model evaluationStep 1: Set environment variables
export OPENAI_API_KEY="your-azure-openai-api-key"
export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"Step 2: Start the TRT MCP server
cd LiveCodeBench/kflow_mcp
bash start_server.shStep 3: Run TRT evaluation (in a separate terminal)
cd LiveCodeBench/bash_scripts
bash kflow_o4-mini.sh # o4-mini with TRT
bash kflow_o3.sh # o3 with TRT| Parameter | Description | Default |
|---|---|---|
--model_name |
Model identifier (e.g., Qwen/Qwen3-235B-A22B-Thinking-2507) |
Required |
--max_new_tokens |
Maximum tokens to generate | 262144 |
--temperature |
Sampling temperature | 0.6 |
--reflex_size |
Number of reflection samples (Maj@N) | 64 |
| Parameter | Description | Default |
|---|---|---|
--model |
Model name/identifier | Required |
--scenario |
Evaluation scenario | codegeneration |
--max_tokens |
Maximum token limit | 200000 |
--trt_rounds |
Number of TRT iterations | 8 |
--roll_out_n |
Number of rollouts per problem | 2 |
--difficulty |
Problem difficulty filter | hard |
TRT achieves 100% accuracy on both AIME-24 and AIME-25 benchmarks:
TRT provides significant improvements on hard coding problems:
| Model | Baseline | TRT | Improvement |
|---|---|---|---|
| o4-mini (high) | 63.5% | 73.9% | +10.4pp |
| o3 (high) | 57.1% | 71.9% | +14.8pp |
test-time-recursive-thinking/
├── README.md # This file
├── LICENSE # MIT License
├── setup_env.sh # Environment setup script
├── assets/ # Images and figures
├── AIME/ # AIME mathematical reasoning
│ ├── bash_scripts/ # Experiment launch scripts
│ ├── scripts/ # Python evaluation scripts
│ ├── requirements.txt # Dependencies
│ └── README.md
└── LiveCodeBench/ # Code generation evaluation
├── bash_scripts/ # Experiment launch scripts
├── kflow_mcp/ # TRT MCP server
├── lcb_runner/ # Main evaluation runner
├── pyproject.toml # Package configuration
└── README.md
If you have any questions related to the code or the paper, feel free to reach out to us at y5zhuang@ucsd.edu.
If you find our paper and code useful, please cite us:
@misc{zhuang2026testtimerecursivethinkingselfimprovement,
title={Test-time Recursive Thinking: Self-Improvement without External Feedback},
author={Yufan Zhuang and Chandan Singh and Liyuan Liu and Yelong Shen and Dinghuai Zhang and Jingbo Shang and Jianfeng Gao and Weizhu Chen},
year={2026},
eprint={2602.03094},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.03094},
}