Official implementation of the paper PEAR: Phase Entropy Aware Reward for Efficient Reasoning.
We introduce Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly.
This project is built upon the veRL framework, an open-source toolkit for reinforcement learning.
Installation Steps:
-
Install the veRL framework following the official documentation
-
Clone this repository:
git clone https://github.com/iNLP-Lab/PEAR.git cd PEAR
We use the GSM8K training set as our default dataset for mathematical reasoning tasks. You can easily adapt PEAR to other reasoning datasets by modifying the data loading configuration.
To use PEAR reward in your training pipeline:
from entropy import compute_score
# Compute reward for a generated response
reward = compute_score(
solution_str=generated_text,
ground_truth=correct_answer,
old_log_prob={"old_log_probs": log_probs, "entropys": entropies},
valid_response_ids=token_ids,
tokenizer=tokenizer,
method='strict', # or 'flexible'
score=1.0,
format_score=0.0
)Follow these steps to integrate PEAR into the veRL framework:
-
Set up veRL environment
Ensure you have a working veRL installation (see Installation).
-
Install PEAR reward module
Copy the PEAR reward calculation file to the veRL utils directory:
cp entropy.py /path/to/verl/verl/utils/reward_score/
-
Update veRL configuration
Replace the
__init__.pyfile to import the PEAR reward:cp __init__.py /path/to/verl/verl/utils/reward_score/__init__.py
-
Integrate reward manager
Replace the reward manager to pass entropy values to the scoring function:
cp naive.py /path/to/verl/verl/workers/reward_manager/naive.py
-
Update ray_trainer
Replace the ray trainer, the main update here is to integrate old_log_prob and entropy before calculating the reward:
cp ray_trainer.py /path/to/verl/verl/trainer/ppo/ray_trainer.py
-
Configure training parameters
Modify the training script with your desired hyperparameters. We use the following settings in our paper:
-
$$\alpha = 1$$ (entropy balance coefficient) - Base score
$s = 1$ - Format penalty
$r_{fmt}=0$
-
-
Launch training
See the veRL GRPO example and adjust settings according to your hardware configuration.
If you find this repo useful, please cite:
@article{huang2025pear,
title={PEAR: Phase Entropy Aware Reward for Efficient Reasoning},
author={Huang, Chen and Lu, Wei and Zhang, Wenxuan},
journal={arXiv preprint arXiv:2510.08026},
year={2025}
}