Skip to content

hskhalaf/hedging

Repository files navigation

Inference-Time Reward Hacking in Large Language Models

Code and visualizations for the paper "Inference-Time Reward Hacking in Large Language Models".

Overview

This repository contains implementations and demonstrations of inference-time reward hacking in LLMs and methods to mitigate it, including:

  • Best-of-n (BoN)
  • Soft Best-of-n (SBoN)
  • Best-of-Poisson (BoP) [Our Method]
  • HedgeTune algorithm for finding optimal inference parameters

Structure

  • src/: Implementation of methods and experiments
  • visuals/: Interactive visualizations and demos
  • docs/: Paper and additional documentation

Paper

For more details, check out our paper on arXiv.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors