We introduce P2R, a two-stage visual reasoning framework that explicitly decouples perception from reasoning.
We train P2R with PRA-GRPO, a role-aware alternating RL strategy that converts final-answer correctness into stage-specific supervision.
P2R consistently outperforms its VLM baselines on both high-resolution fine-grained benchmarks and general multimodal reasoning tasks.
git clone git@github.com:ZJU-REAL/Perceive-to-Reason.git
cd Perceive-to-Reason
conda create -n perceive-to-reason python=3.10 -y
conda activate perceive-to-reason
bash install.shTraining Data
Download the training dataset from P2R-10k and place it under your data directory.
Evaluation Data
Download the following datasets and place them under your data directory.
PRA-GRPO alternates between two stages. Each stage keeps the other role frozen as an inference service, and requires a verifier service for open-ended QA reward.
Note: Before training, configure the service IP addresses and ports in the training scripts (
REASONER_HOST,PERCEIVER_HOST,VERIFIER_HOSTand their corresponding_PORTvariables). Ensure each service uses a different port to avoid conflicts.
Stage 1: Train Perceiver
Start the frozen reasoner and verifier services:
bash scripts/start_reasoner_server.sh
bash scripts/start_verifier_server.shThen launch perceiver training:
bash example/qwen3_vl_4b_p2r/run_pra_grpo_perceiver.shStage 2: Train Reasoner
Start the trained perceiver and verifier services:
bash scripts/start_perceiver_server.sh
bash scripts/start_verifier_server.shThen launch reasoner training:
bash example/qwen3_vl_4b_p2r/run_pra_grpo_reasoner.shEdit evaluation/run_eval_batch.sh to specify your model, mode, and task:
MODEL_NAMES=("your_model")
EVAL_MODE="p2r" # "default" | "thinking" | "p2r"
TASKS=("V-Star") # "V-Star" | "HR-Bench" | "MME-RealWorld-lite" | "MME-RealWorld"Then run:
cd evaluation
bash run_eval_batch.shThis project builds on veRL. Training data is sourced from DeepEyes, Mini-o3, and Zooming-without-Zooming. We thank the authors of those projects.
If you find Perceive-to-Reason useful, please consider citing our work:
@misc{li2026perceivetoreasondecouplingperceptionreasoning,
title={Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning},
author={Hongxing Li and Xiufeng Huang and Dingming Li and Wenjing Jiang and Zixuan Wang and Haolei Xu and Hanrong Zhang and Haiwen Hong and Longtao Huang and Hui Xue and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
year={2026},
eprint={2607.01191},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2607.01191},
}


