Inference-Time Reward Hacking in Large Language Models

Code and visualizations for the paper "Inference-Time Reward Hacking in Large Language Models".

Overview

This repository contains implementations and demonstrations of inference-time reward hacking in LLMs and methods to mitigate it, including:

For more details, check out our paper on arXiv.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
docs		docs
src		src
visuals		visuals
.DS_Store		.DS_Store
README.md		README.md
footer.html		footer.html
header.html		header.html
index.html		index.html
inference-time-reward-hacking.pdf		inference-time-reward-hacking.pdf
main.html		main.html