I don't understand why "RLHF" even needs RL? The reward function is a learned neural network and thus white-box. This means we could simply use straight through estimater (or Gumbel trick) to obtain a much better gradient.
(context: my understanding is from InstructGPT paper)
coding RL bigrun @xAI. Prev: @MSFTResearch / @Apple MLR / FAIR Labs @MetaAI, PhD at @Mila_Quebec, math undergraduate at @PKU1898.
Joined May 2014







