Pinned
1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch.
2) Top accuracy is 0%
3) We will be making the benchmark harder.
How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access.
Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵











