Code and visualizations for the paper "Inference-Time Reward Hacking in Large Language Models".
This repository contains implementations and demonstrations of inference-time reward hacking in LLMs and methods to mitigate it, including:
- Best-of-n (BoN)
- Soft Best-of-n (SBoN)
- Best-of-Poisson (BoP) [Our Method]
- HedgeTune algorithm for finding optimal inference parameters
src/: Implementation of methods and experimentsvisuals/: Interactive visualizations and demosdocs/: Paper and additional documentation
For more details, check out our paper on arXiv.