RL with dense, token-level feedback just got a major upgrade.
Turns out, on-policy self-distillation (OPSD) mostly teaches LLMs to copy writing style—“Therefore”, LaTeX, assertive phrasing—rather than actual reasoning steps. This “privilege-induced style drift” can collapse
00:00

