✨ Archer2.0

🏹️ Reinforcement Learning for Enhanced Reasoning in LLMs 🎯

Overview

Archer2.0 marks a significant evolution from its predecessor through the introduction of Asymmetric Importance Sampling Policy Optimization (ASPO), which is designed to overcome the fundamental limitations of PPO-Clip, effectively mitigating issues like entropy collapse and repetitive outputs, preventing premature convergence, and thereby enabling more advanced reinforcement learning capabilities.

While our mathematical models are still in training and have not converged, we have evaluated Archer2.0 on the LiveCodeBench v5 and v6 code benchmarks. The results are detailed in the table below.

Method	LCB v5 (2024.08.01–2025.02.01)		LCB v6 (2025.02.01–2025.05.01)		Avg.
Method	avg@8	pass@8	avg@16	pass@16	Avg.
DeepSeek-R1-1.5B	16.7	29.0	17.2	34.4	17.0
DAPO	26.0	40.5	27.6	43.5	26.8
DeepCoder-1.5B	23.3	39.1	22.6	42.0	23.0
Nemotron-1.5B	26.1	35.5	29.5	42.8	27.8
Archer-Code-1.5B	29.4	43.7	30.2	45.8	29.8
Archer2.0-Code-1.5B-Preview	31.5	47.0	30.5	46.0	31.0

Getting Started

1 Installation

# Installing Python 3.10 Environment.
conda create -n archer python=3.10 -y
conda activate archer

# Installing dependencies.
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install --no-cache-dir flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

cd Archer2.0
pip install -e .

Initialize Ray Cluster

We have provided a one-click script to initialize Ray environments on any number of machines. Run the following command on the head node:

bash ./tools/start_ray.sh

Note:

Please replace your_wandb_api_key in export WANDB_API_KEY=your_wandb_api_key with your actual key.
Hostfile locations vary across operating systems (e.g., on my machine, it's located at /etc/mpi/hostfile). Locate the file on your server and modify its content accordingly.

2 Training

We have currently provided the script and data to reproduce the results of the “Archer2.0-Code-1.5B-Preview”.

bash ./scripts/train/run_archer2.0_qwen2.5_1.5b_code.sh

3 Evaluation

When using the Verl framework for RL training, we observed a consistent discrepancy between the evaluation results produced by the in-training weights and the saved model checkpoints. To ensure the accurate selection of model checkpoints, our evaluation is conducted using the saved checkpoints.

3.1 Automated Evaluation Pipeline

To automatically scan a specified directory and evaluate all saved model checkpoints during training, run the following script on a GPU-enabled machine:

bash ./tools/run_eval_pipeline.sh

Since code evaluation tasks run on CPU only, we separate the LiveCodeBench evaluation to optimize GPU utilization. Execute the following script on a CPU machine to automatically evaluate the inference results generated in the previous step:

bash ./tools/run_lcb_eval.sh

3.2 Head-On Evaluation

Step 1: Convert Model Format

Run the following command to convert the model to Hugging Face format:

bash ./tools/model_merge.sh

Step 2: Run Inference

Execute the script below to generate inference results for the test data:

bash ./scripts/eval/run_eval.sh

Step 3: Run Evaluation

Navigate to line 245 in LiveCodeBench/blob/main/lcb_runner/evaluation/compute_code_generation_metrics_v5.py and update the parquet_file path to point to the result file generated in Step 2.

Execute the following script to evaluate performance on the LiveCodeBench v5 benchmark:

python LiveCodeBench/lcb_runner/evaluation/compute_code_generation_metrics_v5.py

Note: Please update the path parameters in the scripts above as needed.

Technical Report

ASPO: Asymmetric Importance Sampling Policy Optimization

Acknowledgements

We build our model upon DeepSeek-R1-Distill-Qwen-1.5B.
Training was carried out with a modified version of verl.

Citation

Please cite the following:

@article{wang2025aspo,
  title={Aspo: Asymmetric importance sampling policy optimization},
  author={Wang, Jiakang and Liu, Runze and Lin, Lei and Hu, Wenping and Li, Xiu and Zhang, Fuzheng and Zhou, Guorui and Gai, Kun},
  journal={arXiv preprint arXiv:2510.06062},
  year={2025}
}

@article{wang2025stabilizing,
  title={Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR},
  author={Wang, Jiakang and Liu, Runze and Zhang, Fuzheng and Li, Xiu and Zhou, Guorui},
  journal={arXiv preprint arXiv:2507.15778},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
dapo		dapo
rewards		rewards
scripts		scripts
tools		tools
verl		verl
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✨ Archer2.0

Overview

Getting Started

1 Installation

Initialize Ray Cluster

2 Training

3 Evaluation

3.1 Automated Evaluation Pipeline

3.2 Head-On Evaluation

Step 1: Convert Model Format

Step 2: Run Inference

Step 3: Run Evaluation

Technical Report

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

wizard-III/Archer2.0

Folders and files

Latest commit

History

Repository files navigation

✨ Archer2.0

Overview

Getting Started

1 Installation

Initialize Ray Cluster

2 Training

3 Evaluation

3.1 Automated Evaluation Pipeline

3.2 Head-On Evaluation

Step 1: Convert Model Format

Step 2: Run Inference

Step 3: Run Evaluation

Technical Report

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages