You think on-policy sampling gives the best reward models? Think again! 🔥
Our finding: Even with on-policy data, reward models misalign with policy optimization goals!
Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)
Yunzhen Feng
136 posts
Joined May 2022
- 🔥 TLDR: Synthetic data is making waves on this week's Nature cover! 1. We've developed a general theoretical framework and validated it on Llama 2 7B. 2. Our findings reveal a troubling **scaling collapse** where synthetic data defies expected scaling laws! 1/N
- Replying to @dohmatobelvisWhy we refused to cite the paper—and why all PC members unanimously agreed citation was not required. @BlackHC @miniapeur @roydanroy @canaesseth @thegautamkamath @suchenzang @jeremyphoward @krismicinski @dileeplearning
- Check out our poster tmr at 10am at the ICLR Bidirectional Human-AI Alignment workshop! We cover how on-policy preference sampling can be biased and our optimal response sampling for human labeling. @NYUDataScience @AIatMeta @KempeLab @YaqiDuanPKU x.com/feeelix_feng/s…You think on-policy sampling gives the best reward models? Think again! 🔥 Our finding: Even with on-policy data, reward models misalign with policy optimization goals! Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)
- Check out our paper at ICML 2024 on how AI generated data changes scaling laws!1/N Scale is not all you need: Real data might help you. Synthetic data will eventually kill you! Our paper "Tail of Tales: Model Collapse as a Change of Scaling Laws" has been accepted at #ICML2024 arxiv.org/abs/2402.07043 Joint work with @feeelix_feng (@nyuniversity),
- Replying to @feeelix_feng
- Replying to @feeelix_fengSolution: Inject interpolations into on-policy data! PILAF's philosophy: 🔍 Sample responses via policy interpolation → balance optimism (explore better actions) and conservatism (anchor to reference policy) ➔ Generates more informative pref pairs for learning r̂ (5/11)
- Replying to @feeelix_fengThis work reframes pref data collection: on-policy data is not enough for RLHF. It's NOT just about "gathering more data" or even "gathering on-policy data" – it's about strategically sampling data that maximally reduces reward model bias during policy evolution. 🧠(10/11)
- Replying to @feeelix_feng @KempeLab and 5 othersStandard RLHF pipeline (repetitively): 1️⃣ Collect pref data by sampling (y₁,y₂) from current policy π 2️⃣ Train reward model r̂ via MLE (assuming Bradley-Terry model with human value r*) 3️⃣ Optimize π with r̂ (3/11)
- Replying to @feeelix_fengPILAF: policy-interpolated learning for aligned feedback An approximation of sampling is: For each prompt, With 50% prob: sample y₁, y₂ ~ π (current policy) With 50% prob: sample y₁ ~ (1+β)π - βπ_ref (optimistic) y₂ ~ (1-β)π + βπ_ref (conservative) (6/11)
- Replying to @feeelix_fengPILAF vs. baselines in iterative/online DPO: ✅ Higher reward with lower KL divergence ✅ Saves annotation + computation for similar performance ✅ No hyperparameters to tune! (8/11)
- Check out our new paper on how feedback enhances synthetic data generated by AI models!How to leverage AI-synthesized data without catastrophic degradation? Rank-and-prune feedback, from humans or even weaker models, provably restores and even surpasses original performance! See arxiv.org/abs/2406.07515 @AIatMeta @feeelix_feng @dohmatobelvis @f_charton @yangpuPKU
- Replying to @feeelix_fengPILAF's principles apply to DPO, PPO, and beyond! For researchers working on: ✔️ Reward modeling theory ✔️ LLM alignment dynamics This paper offers new insights: What makes preference data effective for RLHF? 📄(11/11)
- Replying to @feeelix_fengWhy does interpolation work? 1️⃣ Optimization: ➔ PILAF's gradient aligns with oracle reward r*'s policy gradient ➔ Ensures policy updates maximize r* (human values) 2️⃣ Statistical: ➔ Samples focus on high-sensitivity regions of r* ➔ Converge with constantly high rewards (7/11)












