Skip to content

EvanZhuang/test_time_recursive_thinking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Test-time Recursive Thinking (TRT)

Self-Improvement without External Feedback (arXiv)


What is TRT?

Test-time Recursive Thinking (TRT) is an agentic framework that enables LLMs to self-improve during inference through iterative reflection—without requiring external feedback or reward signals.

TRT operates in three stages:

  1. Generate: The model produces multiple solution candidates for a given problem
  2. Select: Solutions are evaluated and the best candidates are identified using self-consistency or verification
  3. Reflect: The model analyzes successful and failed attempts to extract generalizable insights, which inform subsequent generation rounds

This recursive process allows the model to accumulate knowledge within a session, progressively improving solution quality through self-directed learning.


Installation

Prerequisites

  • Python >= 3.10
  • CUDA-compatible GPU (recommended for vLLM-based experiments)
  • Azure OpenAI API access (for LiveCodeBench experiments)

Quick Install

git clone https://github.com/YufanZhuang/test-time-recursive-thinking.git
cd test-time-recursive-thinking
./setup_env.sh

Manual Install

# Install AIME dependencies
cd AIME
pip install -r requirements.txt

# Install LiveCodeBench
cd ../LiveCodeBench
pip install -e .

# Install MCP server dependencies (for TRT agentic mode)
pip install mcp fastmcp aiofiles orjson

Quick Start

AIME Mathematical Reasoning

cd AIME/bash_scripts
bash qwen3.sh     # Run Qwen3-235B evaluation
bash gpt_oss.sh   # Run GPT model evaluation

LiveCodeBench Code Generation

Step 1: Set environment variables

export OPENAI_API_KEY="your-azure-openai-api-key"
export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"

Step 2: Start the TRT MCP server

cd LiveCodeBench/kflow_mcp
bash start_server.sh

Step 3: Run TRT evaluation (in a separate terminal)

cd LiveCodeBench/bash_scripts
bash kflow_o4-mini.sh   # o4-mini with TRT
bash kflow_o3.sh        # o3 with TRT

Configuration Options

AIME Parameters

Parameter Description Default
--model_name Model identifier (e.g., Qwen/Qwen3-235B-A22B-Thinking-2507) Required
--max_new_tokens Maximum tokens to generate 262144
--temperature Sampling temperature 0.6
--reflex_size Number of reflection samples (Maj@N) 64

LiveCodeBench Parameters

Parameter Description Default
--model Model name/identifier Required
--scenario Evaluation scenario codegeneration
--max_tokens Maximum token limit 200000
--trt_rounds Number of TRT iterations 8
--roll_out_n Number of rollouts per problem 2
--difficulty Problem difficulty filter hard

Evaluations

AIME Mathematical Reasoning

TRT achieves 100% accuracy on both AIME-24 and AIME-25 benchmarks:

LiveCodeBench Code Generation (Hard Problems)

TRT provides significant improvements on hard coding problems:

Model Baseline TRT Improvement
o4-mini (high) 63.5% 73.9% +10.4pp
o3 (high) 57.1% 71.9% +14.8pp

Project Structure

test-time-recursive-thinking/
├── README.md                 # This file
├── LICENSE                   # MIT License
├── setup_env.sh              # Environment setup script
├── assets/                   # Images and figures
├── AIME/                     # AIME mathematical reasoning
│   ├── bash_scripts/         # Experiment launch scripts
│   ├── scripts/              # Python evaluation scripts
│   ├── requirements.txt      # Dependencies
│   └── README.md
└── LiveCodeBench/            # Code generation evaluation
    ├── bash_scripts/         # Experiment launch scripts
    ├── kflow_mcp/            # TRT MCP server
    ├── lcb_runner/           # Main evaluation runner
    ├── pyproject.toml        # Package configuration
    └── README.md

Questions?

If you have any questions related to the code or the paper, feel free to reach out to us at y5zhuang@ucsd.edu.


Citation

If you find our paper and code useful, please cite us:

@misc{zhuang2026testtimerecursivethinkingselfimprovement,
      title={Test-time Recursive Thinking: Self-Improvement without External Feedback}, 
      author={Yufan Zhuang and Chandan Singh and Liyuan Liu and Yelong Shen and Dinghuai Zhang and Jingbo Shang and Jianfeng Gao and Weizhu Chen},
      year={2026},
      eprint={2602.03094},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.03094}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors