PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Official implementation of the paper PEAR: Phase Entropy Aware Reward for Efficient Reasoning.

Overview

We introduce Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly.

Installation

This project is built upon the veRL framework, an open-source toolkit for reinforcement learning.

Installation Steps:

Install the veRL framework following the official documentation

Clone this repository:

git clone https://github.com/iNLP-Lab/PEAR.git
cd PEAR

Dataset

We use the GSM8K training set as our default dataset for mathematical reasoning tasks. You can easily adapt PEAR to other reasoning datasets by modifying the data loading configuration.

Usage

Quick Start

To use PEAR reward in your training pipeline:

from entropy import compute_score

# Compute reward for a generated response
reward = compute_score(
    solution_str=generated_text,
    ground_truth=correct_answer,
    old_log_prob={"old_log_probs": log_probs, "entropys": entropies},
    valid_response_ids=token_ids,
    tokenizer=tokenizer,
    method='strict',  # or 'flexible'
    score=1.0,
    format_score=0.0
)

Training

Follow these steps to integrate PEAR into the veRL framework:

Set up veRL environment

Ensure you have a working veRL installation (see Installation).
Install PEAR reward module

Copy the PEAR reward calculation file to the veRL utils directory:
```
cp entropy.py /path/to/verl/verl/utils/reward_score/
```
Update veRL configuration

Replace the __init__.py file to import the PEAR reward:
```
cp __init__.py /path/to/verl/verl/utils/reward_score/__init__.py
```
Integrate reward manager

Replace the reward manager to pass entropy values to the scoring function:
```
cp naive.py /path/to/verl/verl/workers/reward_manager/naive.py
```
Update ray_trainer

Replace the ray trainer, the main update here is to integrate old_log_prob and entropy before calculating the reward:
```
cp ray_trainer.py /path/to/verl/verl/trainer/ppo/ray_trainer.py
```
Configure training parameters

Modify the training script with your desired hyperparameters. We use the following settings in our paper:
- $$\alpha = 1$$ (entropy balance coefficient)
- Base score $s = 1$
- Format penalty $r_{fmt}=0$
Launch training

See the veRL GRPO example and adjust settings according to your hardware configuration.

Citation

If you find this repo useful, please cite:

@article{huang2025pear,
  title={PEAR: Phase Entropy Aware Reward for Efficient Reasoning},
  author={Huang, Chen and Lu, Wei and Zhang, Wenxuan},
  journal={arXiv preprint arXiv:2510.08026},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Table of Contents

Overview

Installation

Dataset

Usage

Quick Start

Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
__init__.py		__init__.py
entropy.py		entropy.py
naive.py		naive.py
ray_trainer.py		ray_trainer.py

Folders and files

Latest commit

History

Repository files navigation

PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Table of Contents

Overview

Installation

Dataset

Usage

Quick Start

Training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages