Skip to content

iNLP-Lab/PEAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Official implementation of the paper PEAR: Phase Entropy Aware Reward for Efficient Reasoning.

Table of Contents

Overview

We introduce Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly.

Installation

This project is built upon the veRL framework, an open-source toolkit for reinforcement learning.

Installation Steps:

  1. Install the veRL framework following the official documentation

  2. Clone this repository:

    git clone https://github.com/iNLP-Lab/PEAR.git
    cd PEAR

Dataset

We use the GSM8K training set as our default dataset for mathematical reasoning tasks. You can easily adapt PEAR to other reasoning datasets by modifying the data loading configuration.

Usage

Quick Start

To use PEAR reward in your training pipeline:

from entropy import compute_score

# Compute reward for a generated response
reward = compute_score(
    solution_str=generated_text,
    ground_truth=correct_answer,
    old_log_prob={"old_log_probs": log_probs, "entropys": entropies},
    valid_response_ids=token_ids,
    tokenizer=tokenizer,
    method='strict',  # or 'flexible'
    score=1.0,
    format_score=0.0
)

Training

Follow these steps to integrate PEAR into the veRL framework:

  1. Set up veRL environment

    Ensure you have a working veRL installation (see Installation).

  2. Install PEAR reward module

    Copy the PEAR reward calculation file to the veRL utils directory:

    cp entropy.py /path/to/verl/verl/utils/reward_score/
  3. Update veRL configuration

    Replace the __init__.py file to import the PEAR reward:

    cp __init__.py /path/to/verl/verl/utils/reward_score/__init__.py
  4. Integrate reward manager

    Replace the reward manager to pass entropy values to the scoring function:

    cp naive.py /path/to/verl/verl/workers/reward_manager/naive.py
  5. Update ray_trainer

    Replace the ray trainer, the main update here is to integrate old_log_prob and entropy before calculating the reward:

    cp ray_trainer.py /path/to/verl/verl/trainer/ppo/ray_trainer.py
  6. Configure training parameters

    Modify the training script with your desired hyperparameters. We use the following settings in our paper:

    • $$\alpha = 1$$ (entropy balance coefficient)
    • Base score $s = 1$
    • Format penalty $r_{fmt}=0$
  7. Launch training

    See the veRL GRPO example and adjust settings according to your hardware configuration.

Citation

If you find this repo useful, please cite:

@article{huang2025pear,
  title={PEAR: Phase Entropy Aware Reward for Efficient Reasoning},
  author={Huang, Chen and Lu, Wei and Zhang, Wenxuan},
  journal={arXiv preprint arXiv:2510.08026},
  year={2025}
}

About

[ICLR 2026]Open repository for paper "PEAR: Phase Entropy Aware Reward for Efficient Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages