Yunzhen Feng (@feeelix

Yunzhen Feng

136 posts

Yunzhen Feng

@feeelix_feng

PhD at CDS, NYU. Ex-Intern at GenAI, FAIR @AIatMeta. Previously undergrad at @PKU1898

Joined May 2022

Yunzhen Feng
@feeelix_feng
Feb 8, 2025
You think on-policy sampling gives the best reward models? Think again! 🔥 Our finding: Even with on-policy data, reward models misalign with policy optimization goals! Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)
29K
Yunzhen Feng
@feeelix_feng
Jul 26, 2024
🔥 TLDR: Synthetic data is making waves on this week's Nature cover! 1. We've developed a general theoretical framework and validated it on Llama 2 7B. 2. Our findings reveal a troubling **scaling collapse** where synthetic data defies expected scaling laws! 1/N
15K
Yunzhen Feng
@feeelix_feng
Apr 12, 2025
Replying to @dohmatobelvis
Why we refused to cite the paper—and why all PC members unanimously agreed citation was not required. @BlackHC @miniapeur @roydanroy @canaesseth @thegautamkamath @suchenzang @jeremyphoward @krismicinski @dileeplearning
19K
Yunzhen Feng
@feeelix_feng
Apr 27, 2025
Check out our poster tmr at 10am at the ICLR Bidirectional Human-AI Alignment workshop! We cover how on-policy preference sampling can be biased and our optimal response sampling for human labeling. @NYUDataScience @AIatMeta @KempeLab @YaqiDuanPKU x.com/feeelix_feng/s…
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
You think on-policy sampling gives the best reward models? Think again! 🔥 Our finding: Even with on-policy data, reward models misalign with policy optimization goals! Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)
5.5K
Yunzhen Feng
@feeelix_feng
May 4, 2024
Check out our paper at ICML 2024 on how AI generated data changes scaling laws!
Elvis Dohmatob
@dohmatobelvis
May 3, 2024
1/N Scale is not all you need: Real data might help you. Synthetic data will eventually kill you! Our paper "Tail of Tales: Model Collapse as a Change of Scaling Laws" has been accepted at #ICML2024 arxiv.org/abs/2402.07043 Joint work with @feeelix_feng (@nyuniversity),
3.3K
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng
Paper: arxiv.org/abs/2502.04270 🧵 @kempelab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience
arxiv.org
PILAF: Optimal Human Preference Sampling for Reward Modeling
As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key...
666
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng
Solution: Inject interpolations into on-policy data! PILAF's philosophy: 🔍 Sample responses via policy interpolation → balance optimism (explore better actions) and conservatism (anchor to reference policy) ➔ Generates more informative pref pairs for learning r̂ (5/11)
393
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng
This work reframes pref data collection: on-policy data is not enough for RLHF. It's NOT just about "gathering more data" or even "gathering on-policy data" – it's about strategically sampling data that maximally reduces reward model bias during policy evolution. 🧠(10/11)
507
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng @KempeLab and 5 others
Standard RLHF pipeline (repetitively): 1️⃣ Collect pref data by sampling (y₁,y₂) from current policy π 2️⃣ Train reward model r̂ via MLE (assuming Bradley-Terry model with human value r*) 3️⃣ Optimize π with r̂ (3/11)
615
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng
PILAF: policy-interpolated learning for aligned feedback An approximation of sampling is: For each prompt, With 50% prob: sample y₁, y₂ ~ π (current policy) With 50% prob: sample y₁ ~ (1+β)π - βπ_ref (optimistic) y₂ ~ (1-β)π + βπ_ref (conservative) (6/11)
337
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng
PILAF vs. baselines in iterative/online DPO: ✅ Higher reward with lower KL divergence ✅ Saves annotation + computation for similar performance ✅ No hyperparameters to tune! (8/11)
322
Yunzhen Feng
@feeelix_feng
Jun 12, 2024
Check out our new paper on how feedback enhances synthetic data generated by AI models!
Julia Kempe
@KempeLab
Jun 12, 2024
How to leverage AI-synthesized data without catastrophic degradation? Rank-and-prune feedback, from humans or even weaker models, provably restores and even surpasses original performance! See arxiv.org/abs/2406.07515 @AIatMeta @feeelix_feng @dohmatobelvis @f_charton @yangpuPKU
452
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng
PILAF's principles apply to DPO, PPO, and beyond! For researchers working on: ✔️ Reward modeling theory ✔️ LLM alignment dynamics This paper offers new insights: What makes preference data effective for RLHF? 📄(11/11)
447
Yunzhen Feng
@feeelix_feng
Feb 8, 2025
Replying to @feeelix_feng
Why does interpolation work? 1️⃣ Optimization: ➔ PILAF's gradient aligns with oracle reward r*'s policy gradient ➔ Ensures policy updates maximize r* (human values) 2️⃣ Statistical: ➔ Samples focus on high-sensitivity regions of r* ➔ Converge with constantly high rewards (7/11)
334