Test-time Recursive Thinking (TRT)

Self-Improvement without External Feedback (arXiv)

What is TRT?

Test-time Recursive Thinking (TRT) is an agentic framework that enables LLMs to self-improve during inference through iterative reflection—without requiring external feedback or reward signals.

TRT operates in three stages:

Generate: The model produces multiple solution candidates for a given problem
Select: Solutions are evaluated and the best candidates are identified using self-consistency or verification
Reflect: The model analyzes successful and failed attempts to extract generalizable insights, which inform subsequent generation rounds

This recursive process allows the model to accumulate knowledge within a session, progressively improving solution quality through self-directed learning.

Installation

Prerequisites

Python >= 3.10
CUDA-compatible GPU (recommended for vLLM-based experiments)
Azure OpenAI API access (for LiveCodeBench experiments)

Quick Install

git clone https://github.com/YufanZhuang/test-time-recursive-thinking.git
cd test-time-recursive-thinking
./setup_env.sh

Manual Install

# Install AIME dependencies
cd AIME
pip install -r requirements.txt

# Install LiveCodeBench
cd ../LiveCodeBench
pip install -e .

# Install MCP server dependencies (for TRT agentic mode)
pip install mcp fastmcp aiofiles orjson

Quick Start

AIME Mathematical Reasoning

cd AIME/bash_scripts
bash qwen3.sh     # Run Qwen3-235B evaluation
bash gpt_oss.sh   # Run GPT model evaluation

LiveCodeBench Code Generation

Step 1: Set environment variables

export OPENAI_API_KEY="your-azure-openai-api-key"
export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"

Step 2: Start the TRT MCP server

cd LiveCodeBench/kflow_mcp
bash start_server.sh

Step 3: Run TRT evaluation (in a separate terminal)

cd LiveCodeBench/bash_scripts
bash kflow_o4-mini.sh   # o4-mini with TRT
bash kflow_o3.sh        # o3 with TRT

Configuration Options

AIME Parameters

Parameter	Description	Default
`--model_name`	Model identifier (e.g., `Qwen/Qwen3-235B-A22B-Thinking-2507`)	Required
`--max_new_tokens`	Maximum tokens to generate	`262144`
`--temperature`	Sampling temperature	`0.6`
`--reflex_size`	Number of reflection samples (Maj@N)	`64`

LiveCodeBench Parameters

Parameter	Description	Default
`--model`	Model name/identifier	Required
`--scenario`	Evaluation scenario	`codegeneration`
`--max_tokens`	Maximum token limit	`200000`
`--trt_rounds`	Number of TRT iterations	`8`
`--roll_out_n`	Number of rollouts per problem	`2`
`--difficulty`	Problem difficulty filter	`hard`

Evaluations

AIME Mathematical Reasoning

TRT achieves 100% accuracy on both AIME-24 and AIME-25 benchmarks:

LiveCodeBench Code Generation (Hard Problems)

TRT provides significant improvements on hard coding problems:

Model	Baseline	TRT	Improvement
o4-mini (high)	63.5%	73.9%	+10.4pp
o3 (high)	57.1%	71.9%	+14.8pp

Project Structure

test-time-recursive-thinking/
├── README.md                 # This file
├── LICENSE                   # MIT License
├── setup_env.sh              # Environment setup script
├── assets/                   # Images and figures
├── AIME/                     # AIME mathematical reasoning
│   ├── bash_scripts/         # Experiment launch scripts
│   ├── scripts/              # Python evaluation scripts
│   ├── requirements.txt      # Dependencies
│   └── README.md
└── LiveCodeBench/            # Code generation evaluation
    ├── bash_scripts/         # Experiment launch scripts
    ├── kflow_mcp/            # TRT MCP server
    ├── lcb_runner/           # Main evaluation runner
    ├── pyproject.toml        # Package configuration
    └── README.md

Questions?

If you have any questions related to the code or the paper, feel free to reach out to us at y5zhuang@ucsd.edu.

Citation

If you find our paper and code useful, please cite us:

@misc{zhuang2026testtimerecursivethinkingselfimprovement,
      title={Test-time Recursive Thinking: Self-Improvement without External Feedback}, 
      author={Yufan Zhuang and Chandan Singh and Liyuan Liu and Yelong Shen and Dinghuai Zhang and Jingbo Shang and Jianfeng Gao and Weizhu Chen},
      year={2026},
      eprint={2602.03094},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.03094}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Test-time Recursive Thinking (TRT)

What is TRT?

Installation

Prerequisites

Quick Install

Manual Install

Quick Start

AIME Mathematical Reasoning

LiveCodeBench Code Generation

Configuration Options

AIME Parameters

LiveCodeBench Parameters

Evaluations

AIME Mathematical Reasoning

LiveCodeBench Code Generation (Hard Problems)

Project Structure

Questions?

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
AIME		AIME
LiveCodeBench		LiveCodeBench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup_env.sh		setup_env.sh

Folders and files

Latest commit

History

Repository files navigation

Test-time Recursive Thinking (TRT)

What is TRT?

Installation

Prerequisites

Quick Install

Manual Install

Quick Start

AIME Mathematical Reasoning

LiveCodeBench Code Generation

Configuration Options

AIME Parameters

LiveCodeBench Parameters

Evaluations

AIME Mathematical Reasoning

LiveCodeBench Code Generation (Hard Problems)

Project Structure

Questions?

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages