Pinned
Excited to release PrefixRL, where we achieved what I thought to be a contradiction - learning from off-policy data with purely on-policy updates. This avoids all the instabilities of off-policy RL.
I think this will let us reuse previous RL and sampling FLOPs much more















