user avatar
Ofir Press
@OfirPress
I push the AI frontier by building tough benchmarks with amazing people. SWE-bench, SWE-agent, SciCode, AlgoTune. Postdoc @Princeton. PhD @nlpnoah @UW.
NYC
Joined June 2016
Posts
  • Pinned
    user avatar
    1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch. 2) Top accuracy is 0% 3) We will be making the benchmark harder.
    How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵
  • user avatar
    My entire feed is OpenAI employees retweet Sam with the heart emoji. If the board doesn't let him back, he's going to start a new company and take a large chunk of those people with him. If the board does let him back, Ilya is going to leave and start a competitor. (1/2)
  • user avatar
    We've found a new way to prompt language models that improves their ability to answer complex questions Our Self-ask prompt first has the model ask and answer simpler subquestions. This structure makes it easy to integrate Google Search into an LM. Watch our demo with GPT-3 🧵⬇️
    00:00
  • user avatar
    There's no moat. You just need $400M and a bunch of good engineers and you can build your own GPT-4. Now we gotta get someone to build an open version.
  • user avatar
    I just discovered regional prompting for image generation and I'm so impressed (wait till the end). From: reddit.com/r/StableDiffus…
    00:00
  • user avatar
    New (1h32m) video lecture: Transformers From Scratch: Building 5 Language Models at Increasing Complexity Levels youtu.be/s09NPN1BSdE It's an intuitive way to learn what every component of a modern transformer LM does and why they're there.
  • user avatar
    Cool new idea from DeepMind: They evaluate LMs by giving them a piece of code, having them describe it, and then asking the LM to rewrite that code given only the description. The metric is the similarity between the original code and the rewritten code. semanticscholar.org/paper/Unsuperv…
  • user avatar
    Can someone fix this table please? Satya should be at the top.
  • user avatar
    Replying to @OfirPress
    I'm sure this chaos and uncertainty sucks for all of those involved but if the world gets 2 strong competing LMing companies out of what used to be OpenAI, we'll all win... Especially if the Sam-led one ends up actually being a bit more open. (2/2)
  • user avatar
    Since Transformer LMs were invented, we’ve wanted them to be able to read longer inputs during inference than they saw during training. Our Attention with Linear Biases enables this, in very few lines of code, without requiring extra params or runtime ofir.io/train_short_te… 🧵⬇
  • user avatar
    As language models grow in size they know more, but do they get better at reasoning? To test GPT-3, we generated lots of questions such as "What is the calling code of the birthplace of Adele?". We show that as GPT size grows, it does not improve its compositional abilities🧵⬇️
    00:00
  • user avatar
    Everyone thinks that you have to increase the input length of language models to improve their performance. Our new Shortformer model shows that by *shortening* inputs performance improves while speed and memory efficiency go up. ⬇(1/n) ofir.io/shortformer.pdf (code below)
  • user avatar
    Transformers can work without using positional embeddings at all. Llama 4 uses positional embs for local attn but not globally. Our paper from 2022 shows why this works- the causal mask allows transformers to infer positions. arxiv.org/pdf/2203.16634
  • user avatar
    Reddit launched in 2005. StackOverflow in 2008. Both are shutting off access to their data because they're annoyed that they aren't getting payed when it gets used for LM training. Silly move- the value of future data is miniscule given that we already have data from 2008-now.