Pinned
Can we train models to have more monitorable CoT?
We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability.
CST produces models that admit to reward hacking and deferring too much to Stanford profs (@ChrisGPotts told me this is very dangerous)


















