Why do we treat train and test times so differently?
Why is one “training” and the other “in-context learning”?
Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize
Ekin Akyürek
699 posts
- How does in-context learning work? Maybe language models unexpectedly discover how to store/simulate/train other models in their hidden units. So, few-shot prompting can be equivalent to fine-tuning running inside of an LM! Could this be true in theory? (⚠no real LM)👇
- ✨ Big life updates ✨ - @afeyzaakyurek and I welcomed our baby! - Successfully defended my PhD and graduated from MIT 🎓 - Joined @OpenAI 🍓 Excited for what's next!
- Deprem için GPT kullanan arkadaşlar için önerilerim var. Gördüğüm kadarıyla Davinci kullanılıyor. Bu şekilde yavaş ve pahalı bir uygulama çıkıyor. Tweetlerden lokasyon ve niyet çıkaran bir scripti paylaşıyorum, 1000 tweeti 1 dakikada ~ücretsiz işleyebilirsiniz. Detaylar🧵
- Can insights from synthetic experiments and interpretability lead to real improvements in language modeling? We: > propose a formal model for in-context learning > uncover "n-gram heads" = high order induction heads, crucial for ICLL > improve Transformer LM perplexity by 6.7%
- I am on the front page of MIT today! I am grateful to MIT News for covering my research! You can read the full paper arxiv.org/abs/2211.15661 I take the opportunity to support the people who suffered from the *unprecedented* earthquake in Turkiye. Trustworthy orgs to donate:Large language models like GPT-3 can learn new tasks without updating their parameters. A new study “could explain almost all of the learning phenomena that we have seen with these large models,” says Ekin Akyürek. mitsha.re/IjIl50MLXLiarxiv.orgWhat learning algorithm is in-context learning? Investigations...Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented...
- At Google, I analyzed data-attribution methods, a.k.a influence, for explaining language model’s predictions on factual queries. Back then, the ambiguity of the ground truth in the real training corpus limited my research, so we devised a clean benchmark! arxiv.org/abs/2205.11482
- Hard to disagree with @karpathy but I disagree on this particular quote: “You may have gone down the wrong alleys until you arrived at the right solution. Every single one of those incorrect things you did, as long as you got to the correct solution, will be upweighted as, ‘Do
- Data augmentation (reflection in CV, paraphrasing in NLP) improves generalization by encouraging models to learn symmetries of data distributions. We identify symmetries automatically (even in multi-modal tasks) from discrete representations of examples. arxiv.org/abs/2201.12926
- It meant a lot to me that OpenAI made hallucination problem as priority zero. I had a chance to contribute these efforts a very tiny bit thanks to @ericmitchellai @erinkav @yanndubs The models aren’t perfect and hallucinations are hard to measure. But I will be surprised if> GPT-5 is the first series of models that actually doesn’t hallucinate basically at all *real-world utility-maxxing instead of benchmark-maxxing intensifies* Disclaimer: GPT-5 is still not perfect and may make (far fewer now) mistakes
- There are three types of storage: activations (in-context), external memory, and model weights. If the models will spend days for a task, then they should be really good at compiling their in-context work to ab external memory or to their weights! Here we try to learn weightsWhat if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
- Thanks for the attention, couple important points: 1) See @MindsAI_Jack, their team is the first one who applied method privately and they get the 1st rank in the competition. 2) See the concurrent work as well: x.com/ellisk_kellis/… 3) Obviously this is not AGI, it's aWhy do we treat train and test times so differently? Why is one “training” and the other “in-context learning”? Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize
- Our paper has been accepted to ICLR 2023 as notable-top-5% (oral). Hoping to present it in Rwanda in May and excited to discuss more on in-context learning ability of language models. #ICLRHow does in-context learning work? Maybe language models unexpectedly discover how to store/simulate/train other models in their hidden units. So, few-shot prompting can be equivalent to fine-tuning running inside of an LM! Could this be true in theory? (⚠no real LM)👇
- The right perspective is not increasing inference-time compute, it should be making train and inference times indistinguishable. Having a model that can go through these stages in and out like us











